Support a hierarchy of filters

Hey, thank you for the great work as always!

We're looking into integrating OpusFilter in Firefox Translations training pipeline! Our workflow is quite automated and most likely we'll keep our own data downloading and merging procedures and use only the opuscleaner-clean tool as a cleaning step of the pipeline (+UI to develop the filters).

I see that the current OpusCleaner workflow assumes a manual setting of all filtering rules for each language pair and each dataset. If we think about training at scale, this approach seems quite impractical, especially when a user doesn't know both languages.

I propose to support a cascade system of independent filters:

Default for all - general rules that can be applied to any language and dataset
Language pair-specific filters that can be applicable to all datasets (we use only en-xx, xx-en pairs)
Language-specific filters for monolingual data
Dataset-specific filters that can be applicable to all languages
Finally the current precise dataset-language-pair rules or dataset-language for mono fixes (we have something similar here)

To apply those we could just run them all in a sequence which is not ideal if the overriding filter fixes some removals and also it wastes resources. Ideally, the merging should happen inside the Opus cleaner. Or we can create a separate tool for merging and producing the final filter. The more specific filters would override the less specific ones.

A possible directory with the filters would look like this:

default.filters.json
ru.filters.json
en.filters.json
en-ru.filters.json
ELRC-3075-wikipedia_health-v1.filters.json
ada83-v1.filters.json
ada83-v1.en.filters.json
ELRC-3075-wikipedia_health-v1.en-ru.filters.json

Then we would be able to use the cleaning tool based on already downloaded data and whatever filters are present independently of the language pair:

opuscleaner-clean --input /data/raw/ELRC-3075-wikipedia_health-v1.en-ru.gz  --output /data/clean/ELRC-3075-wikipedia_health-v1.en-ru.gz --filters /data/filters/*.filters.json en ru

To support this we would also need to make the filter JSON files universal, without mentioning specific languages or datasets. The naming of the filters would be sufficient to determine the target.

The fun part about automatic language identification is that the noise is different in each language. Noise just gets routed to whatever language happened to have the most similar noise.

For example, Wikipedia complained that https://github.com/laurieburchell/open-lid-dataset is classifying Cherokee (an unsupported language) as Japanese. So you'll only see Cherokee noise, including endless teenagers using the Cherokee alphabet to stylize their status messages, in the Japanese text. In this case a simple script filter is all that's needed.

The NLLB paper refers to this as "To mitigate the learning of spurious correlations due to noisy training samples while modeling hundreds of languages, we worked in collaboration with linguists to develop several filters".

Paracrawl for example is not a uniform data set. The Russian bonus data, for example, came from a different pipeline and filtering than the other languages.

Overall I think the concept of training at scale without looking at the data is a bad one.

But I do agree there should be a system of scoping how broadly a filter applies, covering all languages for a dataset for example.

hplt-project / OpusCleaner

Support a hierarchy of filters #101