ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
659 stars 195 forks source link

Automatic capitalization #21

Open fatullayev opened 6 years ago

fatullayev commented 6 years ago

Hi Ottokar, I have trained the system for German language with Europarl data, and the result was great, thank you! I would like the model to predict capitalization also, what do you think of following scenario:

I add new labels to training data US Präsident Trump angekündigt: will be us =ALLUPPERCASE präsident \TITLE trump \TITLE angekündigt :COLON Assign new labels to punctuation symbols which come after capitalized words, so: US, Frankreich, Deustchland will be us ,ALLUPPERCASE_COMMA frankreich ,TITLE_COMMA deustchland \TITLE

But data sparsity problem can arise because we use three different labels instead of one ,COMMA label: ,COMMA ,ALLUPPERCASE_COMMA ,TITLE_COMMA. The second approach can be training 2 different models for punctuation and capitalization, and apply punctuation, then capitalization prediction. Do you think the first approach is sufficient if there's enough data?

I am also concerned about vocabulary size, do you think 100K is enough for production systems?

ottokart commented 6 years ago

Hi!

How about adding a second output layer so a joint model predicts two classes of labels? This approach has worked well for me in the past.

100K vocabulary with mapping all numeric tokens (I used an arbitrary "> 40% of characters are numbers" rule to determine if a token is numeric or not) to a shared token should be good enough in most cases. The demo model (http://bark.phon.ioc.ee/punctuator) has only 62K token vocabulary. I also experimented with a character-based component to have more meaningful representations for the words outside that 100K vocabulary and saw very small gains. Would be great to know if someone has any different experience.

Best, Ottokar