Update training to support CJK

eu9ene commented 6 days ago

Output Moses-tokenized text from the alignments step (we used to remap alignments to whitespace-based tokenization to match the text)
Use detok OpusTrainer modifiers to detokenize the text back after inline noise is added
Add specific OpusTrainer configs
Test Chinese in training tests
Reduce chunk size for alignments as I had problems with it for CJK

closes #747 #751

eu9ene commented 6 days ago

I realized that when running the experiment I forgot to change finetune-student Takscluster kind. As a result raw text was fed to OpusTrainer instead of the tokenized one. So it likely messed things up, that's why I have a bad model. Maybe not only because of that :)

eu9ene commented 2 days ago

This is not changing the tokenization for OpusTrainer, right? Moses has poor support for CJK. I think specific tokenizer for those languages should be used, like jieba or MeCab. I have no experience working with CJK tokenizers, so I'm not really sure which of them should be used.

@ZJaume great feedback! I didn't see this issue.

Yes, this is using the default OpusTrainer tokenization method which is Moses. I suggest leaving tokenizer replacement out of the scope of this already big PR. We have a separate issue about this: #860. So the idea is to use ICU segmented for everything including inference in Firefox. The problem is OpusTrainer doesn't support it. So the plan is to investigate if we can use it, then add support in OpusTrainer and only then we'll be able to switch to it in our pipeline.

I think the current tokenization method might be sufficient for the first experiments with CJK.

eu9ene commented 2 days ago

Maybe what I can do here is abstracting a bit from the tokenization method, for example calling artifacts ".tokenized" instead of ".moses". I can also add some TODOs.

mozilla / translations

Update training to support CJK #904