Open fatullayev opened 6 years ago
Hi!
How about adding a second output layer so a joint model predicts two classes of labels? This approach has worked well for me in the past.
100K vocabulary with mapping all numeric tokens (I used an arbitrary "> 40% of characters are numbers" rule to determine if a token is numeric or not) to a shared
Best, Ottokar
Hi Ottokar, I have trained the system for German language with Europarl data, and the result was great, thank you! I would like the model to predict capitalization also, what do you think of following scenario:
I add new labels to training data
US Präsident Trump angekündigt:
will beus =ALLUPPERCASE präsident \TITLE trump \TITLE angekündigt :COLON
Assign new labels to punctuation symbols which come after capitalized words, so:US, Frankreich, Deustchland
will beus ,ALLUPPERCASE_COMMA frankreich ,TITLE_COMMA deustchland \TITLE
But data sparsity problem can arise because we use three different labels instead of one
,COMMA
label:,COMMA ,ALLUPPERCASE_COMMA ,TITLE_COMMA
. The second approach can be training 2 different models for punctuation and capitalization, and apply punctuation, then capitalization prediction. Do you think the first approach is sufficient if there's enough data?I am also concerned about vocabulary size, do you think 100K is enough for production systems?