Investigate issues with SentencePiece vocabulary for CJK

eu9ene commented 3 months ago

See comments from Jaume:

https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036191497 https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036198055

eu9ene commented 3 months ago

Also adjust character coverage here:

https://github.com/mozilla/firefox-translations-training/blob/2027f4e99b78d45ce73e44ed454c8527e03718f7/pipeline/train/spm-vocab.sh#L86

CJK is recommended to have 0.9995

gregtatum commented 3 months ago

With this work we should make sure we utilize the "normalization tables" in SentencePiece. These can augment the default Unicode normalization. This way we don't have to do Gecko mitigations in the future, like I did with the soft hyphens. These tables will take codepoints and map them to another one before tokenizing it.

I'm not aware of specific things here, but we should add it to our list to check.

ZJaume commented 1 week ago

Reagarding those old comments of mine, I think byte fallback solves most of those issues.

eu9ene commented 3 days ago

@ZJaume I didn't get from your comment if you agree with the character coverage piece. Could you clarify this comment in OpusTrainer:

CJK is recommended to have 0.9995

They're also using --byte_fallback.

This recommendation is also present in the SentencePiece readme.

So, I don't completely understand how character coverage works and how it plays with byte_fallback. Do we still need 0.9995 if using byte_fallback?

ZJaume commented 2 days ago

From what I understand on how SentencePiece works, increasing or setting to 1.0 the character coverage is just forcing the SentencePiece model to have all the characters included in the vocab before learning any pieces. So that's probably why 1.0 is not recommended for CJK, since those languages have a lot more characters to cover, therefore you may end up having many more forced characters as pieces and less learnt pieces from the training corpus.

My take here is that the default coverage (I think it is 0.9985) and byte fallback enabled should be enough for both CJK and non-CJK languages. Because all the characters that are not included in the vocab because they fall in that 0.0005 left out, or because they are not in the SP training sample, will be handled by the byte fallback characters. Furthermore, we also need to have a reasonable amount of cases where a sentence is tokenized using byte fallback tokens. Because we need the all byte fallback tokens embeddings to be trained. Otherwise, if one sentence during inference comes with a character tokenized in byte fallback pieces that are poorly trained, the model will likely hallucinate and throw garbage at the output.

mozilla / translations

Investigate issues with SentencePiece vocabulary for CJK #745