Nikolay:
Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:
Nikolay: Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:
firefox-translations-training/pipeline/clean/tools/clean_parallel.py
Line 93 in 3b3f33b ratio_len = src_len / float(trg_len)
Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3
Afterwards the text should be de-segmented again and prepared for training
Japanese tokenizer should be used in place of jieba for japanese