Open eu9ene opened 3 months ago
I'm pretty concerned with this one, especially since OpusTrainer relies so heavily on Python's whitespace splitting. We might have to rely on a fork here if we want to use these robustness features. I could see replacing all of the instances with a more robust splitter, for instance based on the ICU segmenter, which is fully internationalized for our current language list.
Yeah, it basically expects the text to be already "tokenized". There's support for detokenization/retokenization, so we could tokenize (segment) text in Chinese before feeding it to OpusTrainer and then detokenize before forwarding it to Marian. Another issue is that we also tokenize the text with Moses now when producing the alignments, so we might want to reuse this tokenization.
Overall I'm quite confident that it's all possible because HPLT have trained Chinese models and unit tests for OpusTrainer are even written for Chinese.
@ZJaume here's a context of our discussion with Jelmer: https://github.com/hplt-project/OpusTrainer/issues/38.
So basically so far the decision was that OpusTrainer supports either whitespace tokenized alignments or Moses tokenized ones with detok directives in the config. See this code: https://github.com/hplt-project/OpusTrainer/blob/7be3b4dcc711d740b4b1a48d91a9b0a13d4ea276/src/opustrainer/modifiers/placeholders.py#L260
We used to do Moses tokenization, then training alignments, and then remapping them back to whitespace-based tokenization to be compatible with OpusTrainer. We feed those alignments and original untokenized corpus to OpusTrainer. We also use guided alignments for the student model based on the bergamot recipe. So for this one we specify sentencepiece vocab in the training config and OpusTrainer then remaps the alignments to spm based tokenization and feeds to Marian. Kind off crazy but it works :)
Now, with CJK where whitespace-based word splitting doesn't make sense, I changed all that to the following:
I do this for all languages, both source and target parts of the corpus to simplify implementation. The benefit is we don't have to remap anything on our side, it's all handled by OpusTrainer.
The question is whether this might somehow affect quality negatively. We lose some information due to tokenization/detokenization of the corpus. Maybe it's fine because we anyway use normalize_whitespace filter in OpusCleaner: https://github.com/mozilla/firefox-translations-training/blob/9956ef28e27d051489d7fceaaf78542be2b9d55d/pipeline/clean/opuscleaner/configs/default.filters.json#L12
I was hoping there is a way to get rid of tokenization, but after reading all this, I don't see other ways of doing what OpusTrainer is doing without the tokenization.
We lose some information due to tokenization/detokenization of the corpus. Maybe it's fine because we anyway use normalize_whitespace filter in OpusCleaner:
and SP is also doing whitespace normalization.
As far as I understand some modifiers are not needed (UpperCase, TitleCase) but some can still be used: