mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
155 stars 34 forks source link

Investigate word-based filtering for CJK #899

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

Nikolay: Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:

firefox-translations-training/pipeline/clean/tools/clean_parallel.py

Line 93 in 3b3f33b ratio_len = src_len / float(trg_len)

Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3

Afterwards the text should be de-segmented again and prepared for training

Japanese tokenizer should be used in place of jieba for japanese

gregtatum commented 3 weeks ago

This is another case where the ICU segmenter could be useful, see #860

Screenshot of the ICU segmenter segmenting chinese text on using the Intl.Segmenter API on a word granularity.