ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
659 stars 195 forks source link

Questions / clarifications #2

Closed vince62s closed 7 years ago

vince62s commented 7 years ago

Hi Otto, I have a few questions. It's unclear to me whether the data.py script requires "pre-processed" data, eg with ",COMMA" annotations, or if this is the script that will annotate the data. case 1: if it requires annotated data, do you intend to provide the preparation script ? case 2 : if it will annotate data, what kind of data is required ? tokenized ? normalized for strange punctuation / numbers formatting ?

Can you give an idea of training time depending on amount of data and use of cpu or gpu ?

Last question, if data were properly tokenized, why does it require this specific annotation. Couldn't we use directly the "," and "." ?

thanks for yor insight, and great work by the way.

ottokart commented 7 years ago

Hi Vince, The data.py script requires pre-processed data (case 1). I've used this model mostly in speech recognition and the training data was the same one that I used for language model training (the one and only difference is that LM training data did not have punctuation tokens).

I have not planned to provide the preparation scripts since there is no single right way to do this processing, that works with all languages and datasets. There is no special requirement for numbers formatting (in the Europarl demo, I mapped all numberic tokens to NUM for example). Punctuation should be normalized (no more than one punctuation symbol per inter-word slot). This model does not currently restore intra-word punctuation (e.g. period in 19.5 or A.D.), but I believe that this toolkit can be adjusted for any sequence labelling task with relatively small modifications.

I'll add the training time estimates in the readme at some point. Thanks for the suggestion! Currently I can say that the model processes about 1500 samples per second on a Tesla K20 GPU and the best validation cost is usually achieved around epoch 4-5. So for example a 50M word training set would take up to 50 hours.

The ,COMMA, .PERIOD etc.. annotation is not necessary actually as long as the data is properly tokenized and the punctuation symbols separated from the surrounding words with whitespace ("hi , how are you ?" is also fine for example). You can use any annotation tokens you wish, as long as you define them in data.py script header in PUNCTUATION_VOCABULARY, PUNCTUATION_MAPPING and EOS_TOKENS constants).

Best, Otto

vince62s commented 7 years ago

ok thanks. a simple tokenization should work then. I'll try it.

ottokart commented 7 years ago

There's now an example in ./example that should work well. Training set size is 40M words and should take about 15h on a good GPU.