Monolingual data has a word splitter that won't work for CJK

mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models

https://mozilla.github.io/firefox-translations-training/

Mozilla Public License 2.0

145 stars 31 forks source link

Monolingual data has a word splitter that won't work for CJK #424

Open gregtatum opened 7 months ago

gregtatum commented 7 months ago

Right now it splits on word boundaries, and limits the size of the monolingual data to be less than 100 "words". This needs to be changed to support another segmentation strategy for CJK languages, maybe just a byte limit.

eu9ene commented 1 week ago

This will be solved by removing such filtering from the importer. It's filtered in the monolingual cleaning step anyway. Ideally, we should use OpusCleaner there but it doesn't support monolingual corpus yet.