Open ZJaume opened 1 month ago
Sentence/paragraph separator
This could be really nice when working on the inference engine. @nordzilla and I were looking at the strategy for how we chunk up a page for translation, and I think we would benefit from sending in larger chunks of text for translation at the same time so that they have more context on what's happening on a page. After the translation you would need to retain these separators to reconstruct the DOM.
Split vocabularies
The languages that do not share scripts (or even languages with the same script that are very distant) will benefit from separated vocabularies. Maybe using 64k, like mentioned in https://github.com/mozilla/firefox-translations-training/issues/747, does the same effect, but have not experimented with that.
We've been struggling with Baltic and Slavic languages. I wonder whether using a shared vocab for the languages in Cyrillic is at play here.
Most of the LLM vocabs use BPE and I remember back in the days when SentencePiece was establishing, some papers arguing that SP was worse than BPE for NMT. So probably an experiment with spm_train --model_type bpe
is something that would be worth to consider.
I was going to comment at #745, but I think this translates to a more general discussion about vocabulary building. Although I don't know if this would be considered a meta issue.
Character coverage
I don't think there is need to force 100% coverage when training SentencePiece. https://github.com/mozilla/firefox-translations-training/blob/2027f4e99b78d45ce73e44ed454c8527e03718f7/pipeline/train/spm-vocab.sh#L86 In fact, when byte fallback is enabled the default character coverage should be better because it increases the amount of training instances using the byte fallback tokens. Therefore decreasing the chances of one of the byte fallback token being poorly trained and model hallucinations when that token comes in the input during inference. Also, related to the coverage, there is the training data size for the vocabulary. It doesn't need to be very large to cover the most part of the MT model training data. I think it only needs to be a representative sample.
So, for character coverage I think it is enough to use the default option and the training size it should be enough with 1 or 2 million sentence pairs (random sample). That way we increase the chances of having a strong byte fallback training.
I think this applies for all languages, including CJK.
Numbers
I would recommend the use
split_digits
options to clean all those vocabulary slots occupied by numbers that may only be common in the training set. Been using this lately with good results.User-defined tokens
Misc
It might be useful to add a few more auxiliary user-defined tokens like
__misc1__
__misc2__
etc. just in case in the future there's need to implement a new logic that needs special tokens, so there's no need to retrain the whole model. Just use the auxiliary tokens in a fine-tuning manner.Sentence/paragraph separator
Also add a token like
__sep__
or something similar. So in the future if we want to explore paragraph-level or document-level translations, we can encode the newlines.Backtranslation tagging
I've been using a BT special token to tag backtranslated data, but I have my doubts about this is useful or not.
Split vocabularies
The languages that do not share scripts (or even languages with the same script that are very distant) will benefit from separated vocabularies. Maybe using 64k, like mentioned in #747, does the same effect, but have not experimented with that.
NFKC Normalization
There's one thing that's been annoying me too much, specially when dealing with technical in-domain data, which is superscripts and subscripts being normalized:
So, I do have a custom normalization file built from the original that omits this kind of stuff.