Open eu9ene opened 6 days ago
I realized that when running the experiment I forgot to change finetune-student
Takscluster kind. As a result raw text was fed to OpusTrainer instead of the tokenized one. So it likely messed things up, that's why I have a bad model. Maybe not only because of that :)
This is not changing the tokenization for OpusTrainer, right? Moses has poor support for CJK. I think specific tokenizer for those languages should be used, like jieba or MeCab. I have no experience working with CJK tokenizers, so I'm not really sure which of them should be used.
@ZJaume great feedback! I didn't see this issue.
Yes, this is using the default OpusTrainer tokenization method which is Moses. I suggest leaving tokenizer replacement out of the scope of this already big PR. We have a separate issue about this: #860. So the idea is to use ICU segmented for everything including inference in Firefox. The problem is OpusTrainer doesn't support it. So the plan is to investigate if we can use it, then add support in OpusTrainer and only then we'll be able to switch to it in our pipeline.
I think the current tokenization method might be sufficient for the first experiments with CJK.
Maybe what I can do here is abstracting a bit from the tokenization method, for example calling artifacts ".tokenized" instead of ".moses". I can also add some TODOs.
closes #747 #751