Investigate OpusTrainer compatibility for CJK

eu9ene commented 3 months ago

As far as I understand some modifiers are not needed (UpperCase, TitleCase) but some can still be used:

Noise
Inline noise
Typos? (will the current typos library work with CJK)

gregtatum commented 3 months ago

I'm pretty concerned with this one, especially since OpusTrainer relies so heavily on Python's whitespace splitting. We might have to rely on a fork here if we want to use these robustness features. I could see replacing all of the instances with a more robust splitter, for instance based on the ICU segmenter, which is fully internationalized for our current language list.

eu9ene commented 3 months ago

Yeah, it basically expects the text to be already "tokenized". There's support for detokenization/retokenization, so we could tokenize (segment) text in Chinese before feeding it to OpusTrainer and then detokenize before forwarding it to Marian. Another issue is that we also tokenize the text with Moses now when producing the alignments, so we might want to reuse this tokenization.

Overall I'm quite confident that it's all possible because HPLT have trained Chinese models and unit tests for OpusTrainer are even written for Chinese.

eu9ene commented 1 week ago

@ZJaume here's a context of our discussion with Jelmer: https://github.com/hplt-project/OpusTrainer/issues/38.

So basically so far the decision was that OpusTrainer supports either whitespace tokenized alignments or Moses tokenized ones with detok directives in the config. See this code: https://github.com/hplt-project/OpusTrainer/blob/7be3b4dcc711d740b4b1a48d91a9b0a13d4ea276/src/opustrainer/modifiers/placeholders.py#L260

We used to do Moses tokenization, then training alignments, and then remapping them back to whitespace-based tokenization to be compatible with OpusTrainer. We feed those alignments and original untokenized corpus to OpusTrainer. We also use guided alignments for the student model based on the bergamot recipe. So for this one we specify sentencepiece vocab in the training config and OpusTrainer then remaps the alignments to spm based tokenization and feeds to Marian. Kind off crazy but it works :)

Now, with CJK where whitespace-based word splitting doesn't make sense, I changed all that to the following:

Same as before, first we run Moses tokenization
Train alignments on the tokenized corpus
Feed the tokenized text and alignments to OpusTrainer and use custom_detok_src, custom_detok_trg directives (config example)
OpusTrainer inserts inline noise
OpusTrainer detokenizes text with Moses detokenizer before feeding to Marian
For student model OpusTrainer retokenizes alignments with sentecepiece vocab tokenizer (we use the same vocab for marian, so the tokens and their alignments are supposed to match).

I do this for all languages, both source and target parts of the corpus to simplify implementation. The benefit is we don't have to remap anything on our side, it's all handled by OpusTrainer.

The question is whether this might somehow affect quality negatively. We lose some information due to tokenization/detokenization of the corpus. Maybe it's fine because we anyway use normalize_whitespace filter in OpusCleaner: https://github.com/mozilla/firefox-translations-training/blob/9956ef28e27d051489d7fceaaf78542be2b9d55d/pipeline/clean/opuscleaner/configs/default.filters.json#L12

ZJaume commented 2 days ago

I was hoping there is a way to get rid of tokenization, but after reading all this, I don't see other ways of doing what OpusTrainer is doing without the tokenization.

We lose some information due to tokenization/detokenization of the corpus. Maybe it's fine because we anyway use normalize_whitespace filter in OpusCleaner:

and SP is also doing whitespace normalization.

mozilla / translations

Investigate OpusTrainer compatibility for CJK #750