Where to obtain datasets for training?

ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text

http://bark.phon.ioc.ee/punctuator

MIT License

657 stars 195 forks source link

Where to obtain datasets for training? #62

Open chrisspen opened 4 years ago

chrisspen commented 4 years ago

In your README, you say you trained your model on the TED and Europarl datasets. Where did you obtain these? I can't find any public download links for anything matching those names.

I'd like to train my own model, using those as a starting point, but these datasets don't seem to exist anywhere.

ottokart commented 4 years ago

Hi,

Europarl can be downloaded from here: http://hltshare.fbk.eu/IWSLT2012/training-monolingual-europarl.tgz

The TED dataset was preprocessed by the authors of http://www.lrec-conf.org/proceedings/lrec2016/pdf/103_Paper.pdf and the resulting dataset is shared at: https://drive.google.com/file/d/0B13Cc1a7ebTuMElFWGlYcUlVZ0k/view I used this simple script to convert the format of the files: https://drive.google.com/open?id=1sW23C4kqRJ6rDSBurco8_0lJ3VZJIkta

chrisspen commented 4 years ago

Thanks. However, how do you use that converter.py script on those archives? Each archive contains multiple files.

For example, the LREC archive contains files dev2012, test2011, test2011asr, and train2012. I'm not sure what the difference is between test2011 and test2011asr. The readme just says it's "for ASR output", which tells us nothing. Do I need to convert all of these files?

How do I combine this with the Europarl file? There only appears to be one, europarl-v7.en, and it seems to be in a very different format than the LREC files, as it contains full sentences, whereas the LREC files appear to contain pairs of tokens.

chrisspen commented 4 years ago

Nevermind, I went through the scripts in ./examples, and figured out how to preprocess the raw datasets.

I put the train/dev/test files for both the TED and Europarl files in the same directory, so the data.py would include them all. Is that copacetic?

I'm now training a model using the recommended time python main.py mymodel.pcl 256 0.02, on a system without a GPU. Do you know how long that should take?