Open eu9ene opened 3 months ago
Two things that could be helpful here are the ICU segmenter (which is equivalent to the powerful Intl.Segmenter
in javascript. I believe there are ICU bindings available in Python.
Tutorial: https://www.bedrick.org/notes/python-icu-bindings/
The other package is the unicodedata one, which gives you more information about codepoints.
https://docs.python.org/3/library/unicodedata.html
You can also write regexes using character properties.
It's also important to run normalization on the text (so the codepoints are all represented consistently), which I don't think the opus tools have done a good job using consistently.
I think we should lean in to our OpusCleaner fork for fixes here, and once we are done see if they want to accept any of our fixes.
We should investigate the ICU segmenter separately from this issue. I filed #860
Nikolay: Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters In the past i have used something like u'[\u4e00-\u9fff]', but this may be improved.
Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:
https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/pipeline/clean/tools/clean_parallel.py#L93
Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3
Afterwards the text should be de-segmented again and prepared for training
Japanese tokenizer should be used in place of jieba for japanese