ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
659 stars 195 forks source link

Training data to support for different language #20

Open 350d opened 6 years ago

350d commented 6 years ago

Hello! I'm trying to add support for different language here. I have training data with about 100 000 sentences and can increase it to 1M or so. How many sentences I need to start training and how I need to update ./run.sh file in my case (input file name updated already)? (I've tried to use total and half number of lines already and got these errors:

...
Step 1/3
Step 2/3
[nltk_data] Downloading package punkt to /Users/350d/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Skipped 11383 lines
Step 3/3
head: illegal line count -- -50000
head: illegal line count -- -25000
Cleaning up...
...

Update: ok, OSX head and tail don't accept negative values, fixed with update to coreutils. Will let you known about my progress here...

350d commented 6 years ago

Hello! I've successfully collect training data and created my model for specific language. Now I have a problem where network can't handle simple language rules, like comma before some specific word or column in between other words. How I can predefine some custom rules for this model? Thank you!

ottokart commented 6 years ago

Hi! If you have sufficient amount of decent quality training data (10M - 40M words), then the model should be able to learn most of the rules on its own (although, it will of cource make some mistakes). This toolkit currently does not support manually created custom rules, so you would have to write this extension yourself. You can also try to train some other probabilistic model(s) and interpolate their probabilities with the output of this model (this is also a custom extension).

Best, Ottokar

wltz commented 5 years ago

Is there an upper limit for the training data, or the more the better?