mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Adjust data cleaning for CJK #900

Open eu9ene opened 1 week ago

eu9ene commented 1 week ago

Use custom OpusCleaner configs with disabled word-based filters.

The filters are copied from https://github.com/hplt-project/HPLT-MT-Models/blob/main/v1.0/data/en-zh_hant/raw/v2/HPLT-v1.1.en-zh_hant.filters.json.

I don't think it's feasible to do the src-trg-ratio that requires tokenization now. We would have to move tokenization to a separate step for that and somehow adjust the cleaning step to work with that instead of the original text. I filed https://github.com/mozilla/firefox-translations-training/issues/899

closes #742

gregtatum commented 1 week ago

Oh, and I don't have opinions on the rules themselves, copying from another source seems reasonable, but I didn't think through the rules and how they apply to CJK.