Support CJK in OpusCleaner

eu9ene commented 3 months ago

Nikolay: Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters In the past i have used something like u'[\u4e00-\u9fff]', but this may be improved.

Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:

https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/pipeline/clean/tools/clean_parallel.py#L93

Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3

Afterwards the text should be de-segmented again and prepared for training

Japanese tokenizer should be used in place of jieba for japanese

gregtatum commented 3 months ago

Two things that could be helpful here are the ICU segmenter (which is equivalent to the powerful Intl.Segmenter in javascript. I believe there are ICU bindings available in Python.

Tutorial: https://www.bedrick.org/notes/python-icu-bindings/

The other package is the unicodedata one, which gives you more information about codepoints.

https://docs.python.org/3/library/unicodedata.html

You can also write regexes using character properties.

It's also important to run normalization on the text (so the codepoints are all represented consistently), which I don't think the opus tools have done a good job using consistently.

I think we should lean in to our OpusCleaner fork for fixes here, and once we are done see if they want to accept any of our fixes.

eu9ene commented 3 weeks ago

We should investigate the ICU segmenter separately from this issue. I filed #860

mozilla / firefox-translations-training

Support CJK in OpusCleaner #742