mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
143 stars 31 forks source link

Investigate issues with SentencePiece vocabulary for CJK #745

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

See comments from Jaume:

https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036191497 https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036198055

eu9ene commented 1 month ago

Also adjust character coverage here:

https://github.com/mozilla/firefox-translations-training/blob/2027f4e99b78d45ce73e44ed454c8527e03718f7/pipeline/train/spm-vocab.sh#L86

CJK is recommended to have 0.9995

gregtatum commented 1 month ago

With this work we should make sure we utilize the "normalization tables" in SentencePiece. These can augment the default Unicode normalization. This way we don't have to do Gecko mitigations in the future, like I did with the soft hyphens. These tables will take codepoints and map them to another one before tokenizing it.

I'm not aware of specific things here, but we should add it to our list to check.