Open eu9ene opened 5 months ago
From our list of configs, I also don't see: tr, bs, id, vi, sr
I wonder as a mitigation if we can just skip the feature when it's not supported, and add a note in the training config generation. Maybe the config generator generates a suppression of this feature so we know.
Based on my experiment aphabet ratio filtering can be very efficient so we should just add support for those languages to OpusCleaner.
@gregtatum here's my attempt to extend alphabets but I'm not confident in Vietnamese: https://github.com/eu9ene/OpusCleaner/commit/7aefb5b774106442b438f82132f2af14347945ff
I ran into an issue with Turkish: https://firefox-ci-tc.services.mozilla.com/tasks/Ip5AUlOmRU2hu2yP8RfS0w/runs/0/logs/public/logs/live.log
Specifically
alpha_ratio
filter: https://github.com/hplt-project/OpusCleaner/blob/main/opuscleaner/filters/clean_common.py