mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Consider rebalancing datasets with clustering #844

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

See paper: Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach.

This can be helpful for example for monolingual data where we have a lot of it ( all en-xx language pairs).

Related to #231