Open eu9ene opened 3 months ago
Also adjust character coverage here:
CJK is recommended to have 0.9995
With this work we should make sure we utilize the "normalization tables" in SentencePiece. These can augment the default Unicode normalization. This way we don't have to do Gecko mitigations in the future, like I did with the soft hyphens. These tables will take codepoints and map them to another one before tokenizing it.
I'm not aware of specific things here, but we should add it to our list to check.
Reagarding those old comments of mine, I think byte fallback solves most of those issues.
@ZJaume I didn't get from your comment if you agree with the character coverage piece. Could you clarify this comment in OpusTrainer:
CJK is recommended to have 0.9995
They're also using --byte_fallback
.
This recommendation is also present in the SentencePiece readme.
So, I don't completely understand how character coverage works and how it plays with byte_fallback. Do we still need 0.9995 if using byte_fallback?
From what I understand on how SentencePiece works, increasing or setting to 1.0 the character coverage is just forcing the SentencePiece model to have all the characters included in the vocab before learning any pieces. So that's probably why 1.0 is not recommended for CJK, since those languages have a lot more characters to cover, therefore you may end up having many more forced characters as pieces and less learnt pieces from the training corpus.
My take here is that the default coverage (I think it is 0.9985) and byte fallback enabled should be enough for both CJK and non-CJK languages. Because all the characters that are not included in the vocab because they fall in that 0.0005 left out, or because they are not in the SP training sample, will be handled by the byte fallback characters. Furthermore, we also need to have a reasonable amount of cases where a sentence is tokenized using byte fallback tokens. Because we need the all byte fallback tokens embeddings to be trained. Otherwise, if one sentence during inference comes with a character tokenized in byte fallback pieces that are poorly trained, the model will likely hallucinate and throw garbage at the output.
See comments from Jaume:
https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036191497 https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036198055