mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

OpusCleaner supports only a limited set of languages #649

Open eu9ene opened 5 months ago

eu9ene commented 5 months ago

I ran into an issue with Turkish: https://firefox-ci-tc.services.mozilla.com/tasks/Ip5AUlOmRU2hu2yP8RfS0w/runs/0/logs/public/logs/live.log

Specifically alpha_ratio filter: https://github.com/hplt-project/OpusCleaner/blob/main/opuscleaner/filters/clean_common.py

eu9ene commented 5 months ago

From our list of configs, I also don't see: tr, bs, id, vi, sr

gregtatum commented 5 months ago

I wonder as a mitigation if we can just skip the feature when it's not supported, and add a note in the training config generation. Maybe the config generator generates a suppression of this feature so we know.

eu9ene commented 4 months ago

Based on my experiment aphabet ratio filtering can be very efficient so we should just add support for those languages to OpusCleaner.

eu9ene commented 4 months ago

@gregtatum here's my attempt to extend alphabets but I'm not confident in Vietnamese: https://github.com/eu9ene/OpusCleaner/commit/7aefb5b774106442b438f82132f2af14347945ff