gordicaleksa / Open-NLLB

Effort to open-source NLLB checkpoints.
MIT License
419 stars 37 forks source link

Spanish and Guarani filtering #5

Closed vienneraphael closed 1 year ago

vienneraphael commented 1 year ago

Figure out why we're filtering out that much sentences for Spanish (~38%) and Guarani (~50%)

gordicaleksa commented 1 year ago

Small tip: always be precise about the data source, in this example this was only related to the public bi-text, and the answer was already provided (I know that our Discord channel is a bit overwhelming because of me hah) - duplication filtering occurred here.