Open eu9ene opened 1 year ago
The fun part about automatic language identification is that the noise is different in each language. Noise just gets routed to whatever language happened to have the most similar noise.
For example, Wikipedia complained that https://github.com/laurieburchell/open-lid-dataset is classifying Cherokee (an unsupported language) as Japanese. So you'll only see Cherokee noise, including endless teenagers using the Cherokee alphabet to stylize their status messages, in the Japanese text. In this case a simple script filter is all that's needed.
The NLLB paper refers to this as "To mitigate the learning of spurious correlations due to noisy training samples while modeling hundreds of languages, we worked in collaboration with linguists to develop several filters".
Paracrawl for example is not a uniform data set. The Russian bonus data, for example, came from a different pipeline and filtering than the other languages.
Overall I think the concept of training at scale without looking at the data is a bad one.
But I do agree there should be a system of scoping how broadly a filter applies, covering all languages for a dataset for example.
Hey, thank you for the great work as always!
We're looking into integrating OpusFilter in Firefox Translations training pipeline! Our workflow is quite automated and most likely we'll keep our own data downloading and merging procedures and use only the
opuscleaner-clean
tool as a cleaning step of the pipeline (+UI to develop the filters).I see that the current OpusCleaner workflow assumes a manual setting of all filtering rules for each language pair and each dataset. If we think about training at scale, this approach seems quite impractical, especially when a user doesn't know both languages.
I propose to support a cascade system of independent filters:
To apply those we could just run them all in a sequence which is not ideal if the overriding filter fixes some removals and also it wastes resources. Ideally, the merging should happen inside the Opus cleaner. Or we can create a separate tool for merging and producing the final filter. The more specific filters would override the less specific ones.
A possible directory with the filters would look like this:
Then we would be able to use the cleaning tool based on already downloaded data and whatever filters are present independently of the language pair:
To support this we would also need to make the filter JSON files universal, without mentioning specific languages or datasets. The naming of the filters would be sufficient to determine the target.
Related to https://github.com/hplt-project/OpusCleaner/issues/37