mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Figure out accuracy of mtdata_Neulab-tedtalks datasets #635

Closed gregtatum closed 5 months ago

gregtatum commented 5 months ago
  - mtdata_Neulab-tedtalks_test-1-eng-bos #             ~3,117,009 sentences (352.2 MB)

This test set for English to Bosnian is way too big. Right now the config moves test/dev/train sets to the appropriate parts of the config, but a test set shouldn't have this much data. It requires investigation.

gregtatum commented 5 months ago

The issue is the train/test/dev are all in one big archive, so you would have to fully download it to generate sentence estimates.