Automatic capitalization

ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text

MIT License

659 stars 195 forks source link

Hi Ottokar, I have trained the system for German language with Europarl data, and the result was great, thank you! I would like the model to predict capitalization also, what do you think of following scenario:

I add new labels to training data US Präsident Trump angekündigt: will be us =ALLUPPERCASE präsident \TITLE trump \TITLE angekündigt :COLON Assign new labels to punctuation symbols which come after capitalized words, so: US, Frankreich, Deustchland will be us ,ALLUPPERCASE_COMMA frankreich ,TITLE_COMMA deustchland \TITLE

But data sparsity problem can arise because we use three different labels instead of one ,COMMA label: ,COMMA ,ALLUPPERCASE_COMMA ,TITLE_COMMA. The second approach can be training 2 different models for punctuation and capitalization, and apply punctuation, then capitalization prediction. Do you think the first approach is sufficient if there's enough data?

I am also concerned about vocabulary size, do you think 100K is enough for production systems?

ottokart / punctuator2

Automatic capitalization #21