mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Rewrite merge mono and add support for an OPUS monolingual importer #787

Closed gregtatum closed 2 months ago

gregtatum commented 3 months ago

I was rewriting my nllb mono build script and realized I could just do deduplication in the pipeline rather than in a separate build script. The manual building of the dataset was pretty fiddly, especially with en data that is over 50 gigs.

Resolves #390 Resolves #286