Build a BPE dictionary for Thai and English on the Opensubtitles v2018 dataset without text normalization. It will be done by SentencePiece trainer (Unigram Model).
For example
a word with an uppercase character "Charin" will be not normalized to the lower case form (i.e."charin").
Digits will be not converted into only "0". For example, it should not convert "2450" to "0000" as BPEmb does. (see an example of Thai BPE by BPEmb)
Due date:
Build a BPE dictionary for Thai and English on the Opensubtitles v2018 dataset without text normalization. It will be done by SentencePiece trainer (Unigram Model).
For example
a word with an uppercase character "Charin" will be not normalized to the lower case form (i.e."charin").
Digits will be not converted into only "0". For example, it should not convert "2450" to "0000" as BPEmb does. (see an example of Thai BPE by BPEmb)
Results:
Explanation:
The token
▁You
meansYou
that is placed at the start of the sentence. On the other hand, the token▁you
meansyou
is after a space token.