mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Update training to support CJK #904

Open eu9ene opened 6 days ago

eu9ene commented 6 days ago

closes #747 #751

eu9ene commented 6 days ago

I realized that when running the experiment I forgot to change finetune-student Takscluster kind. As a result raw text was fed to OpusTrainer instead of the tokenized one. So it likely messed things up, that's why I have a bad model. Maybe not only because of that :)

eu9ene commented 2 days ago

This is not changing the tokenization for OpusTrainer, right? Moses has poor support for CJK. I think specific tokenizer for those languages should be used, like jieba or MeCab. I have no experience working with CJK tokenizers, so I'm not really sure which of them should be used.

@ZJaume great feedback! I didn't see this issue.

Yes, this is using the default OpusTrainer tokenization method which is Moses. I suggest leaving tokenizer replacement out of the scope of this already big PR. We have a separate issue about this: #860. So the idea is to use ICU segmented for everything including inference in Firefox. The problem is OpusTrainer doesn't support it. So the plan is to investigate if we can use it, then add support in OpusTrainer and only then we'll be able to switch to it in our pipeline.

I think the current tokenization method might be sufficient for the first experiments with CJK.

eu9ene commented 2 days ago

Maybe what I can do here is abstracting a bit from the tokenization method, for example calling artifacts ".tokenized" instead of ".moses". I can also add some TODOs.