ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
657 stars 195 forks source link

How to distribute the data to ep.train.txt, ep.dev.txt, and ep.test.txt? what's the purpose of these files? #38

Open wltz opened 5 years ago

wltz commented 5 years ago

head -n -400000 step2.txt > ./out/ep.train.txt tail -n 400000 step2.txt > step3.txt head -n -200000 step3.txt > ./out/ep.dev.txt tail -n 200000 step3.txt > ./out/ep.test.txt

Hi ottokart, Could you elaborate on how to distribute the data from corpus to these three files? And what's the purpose of these files? I have a small corpus file, 65k lines and about 3M words. So, I need to know how should I distribute the data to these files. Thanks!

ottokart commented 5 years ago

Hi!

That's quite small dataset. I think I would split it into 80% training, 10% dev and 10% test data. The training file is obviously used for training the parameters of the model; dev set is used for finding good hyperparameters (hidden layer size, learning rate etc...) and the training script uses the score on dev set to decide when to stop training to prevent overfitting; test set is used for final evaluation and should not be touched during the training and development of the model.

AbdallahQoutbAli commented 4 years ago

Where can I find dataset ? and code sys.arg[0] make error in all files