Vocabulary construction

ZJaume commented 1 week ago

I was going to comment at #745, but I think this translates to a more general discussion about vocabulary building. Although I don't know if this would be considered a meta issue.

Character coverage

I don't think there is need to force 100% coverage when training SentencePiece. https://github.com/mozilla/firefox-translations-training/blob/2027f4e99b78d45ce73e44ed454c8527e03718f7/pipeline/train/spm-vocab.sh#L86 In fact, when byte fallback is enabled the default character coverage should be better because it increases the amount of training instances using the byte fallback tokens. Therefore decreasing the chances of one of the byte fallback token being poorly trained and model hallucinations when that token comes in the input during inference. Also, related to the coverage, there is the training data size for the vocabulary. It doesn't need to be very large to cover the most part of the MT model training data. I think it only needs to be a representative sample.

So, for character coverage I think it is enough to use the default option and the training size it should be enough with 1 or 2 million sentence pairs (random sample). That way we increase the chances of having a strong byte fallback training.

I think this applies for all languages, including CJK.

Numbers

I would recommend the use split_digits options to clean all those vocabulary slots occupied by numbers that may only be common in the training set. Been using this lately with good results.

User-defined tokens

Misc

It might be useful to add a few more auxiliary user-defined tokens like __misc1__ __misc2__ etc. just in case in the future there's need to implement a new logic that needs special tokens, so there's no need to retrain the whole model. Just use the auxiliary tokens in a fine-tuning manner.

Sentence/paragraph separator

Also add a token like __sep__ or something similar. So in the future if we want to explore paragraph-level or document-level translations, we can encode the newlines.

Backtranslation tagging

I've been using a BT special token to tag backtranslated data, but I have my doubts about this is useful or not.

Split vocabularies

The languages that do not share scripts (or even languages with the same script that are very distant) will benefit from separated vocabularies. Maybe using 64k, like mentioned in #747, does the same effect, but have not experimented with that.

NFKC Normalization

There's one thing that's been annoying me too much, specially when dealing with technical in-domain data, which is superscripts and subscripts being normalized:

...
2074    34  # ⁴ => 4
2075    35  # ⁵ => 5
2076    36  # ⁶ => 6
2077    37  # ⁷ => 7
2078    38  # ⁸ => 8
2079    39  # ⁹ => 9
207A    2B  # ⁺ => +
207B    2212    # ⁻ => −
207C    3D  # ⁼ => =
207C 338    2260    # ⁼̸ => ≠
207D    28  # ⁽ => (
207E    29  # ⁾ => )
207F    6E  # ⁿ => n
...

So, I do have a custom normalization file built from the original that omits this kind of stuff.

gregtatum commented 1 week ago

Sentence/paragraph separator

This could be really nice when working on the inference engine. @nordzilla and I were looking at the strategy for how we chunk up a page for translation, and I think we would benefit from sending in larger chunks of text for translation at the same time so that they have more context on what's happening on a page. After the translation you would need to retain these separators to reconstruct the DOM.

eu9ene commented 1 week ago

Split vocabularies

The languages that do not share scripts (or even languages with the same script that are very distant) will benefit from separated vocabularies. Maybe using 64k, like mentioned in https://github.com/mozilla/firefox-translations-training/issues/747, does the same effect, but have not experimented with that.

We've been struggling with Baltic and Slavic languages. I wonder whether using a shared vocab for the languages in Cyrillic is at play here.

ZJaume commented 1 day ago

Most of the LLM vocabs use BPE and I remember back in the days when SentencePiece was establishing, some papers arguing that SP was worse than BPE for NMT. So probably an experiment with spm_train --model_type bpe is something that would be worth to consider.

mozilla / translations