Open eu9ene opened 1 month ago
Also adjust character coverage here:
CJK is recommended to have 0.9995
With this work we should make sure we utilize the "normalization tables" in SentencePiece. These can augment the default Unicode normalization. This way we don't have to do Gecko mitigations in the future, like I did with the soft hyphens. These tables will take codepoints and map them to another one before tokenizing it.
I'm not aware of specific things here, but we should add it to our list to check.
See comments from Jaume:
https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036191497 https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036198055