Some mtdata datasets fail because of long name

mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models

https://mozilla.github.io/firefox-translations-training/

Mozilla Public License 2.0

145 stars 31 forks source link

Some mtdata datasets fail because of long name #654

Closed eu9ene closed 2 months ago

eu9ene commented 4 months ago

It's related to the Taskcluster limitation of the label size (256 characters)

Example:

- mtdata_ELRC-convention_against_torture_other_cruel_inhuman_or_degrading_treatment_or_punishment_united_nations-1-ell-eng

bhearsum commented 3 months ago

We should fix this by shortening the name somehow. We could run these names through some function that guarantees:

a short enough name
uniqueness

bhearsum commented 3 months ago

@eu9ene, @gregtatum - how would you feel about taking the first N characters of datasets, truncating the rest, and adding an md5 (or similar) hash of the full dataset to the end? This would keep the datasets reasonable readable and ensure uniqueness without needing to maintain some sort of lookup table.

eu9ene commented 3 months ago

Sounds good. I don't think the label is particularly important for training datasets. For evals, we use short dataset names.

bhearsum commented 3 months ago

Oh, I see we're already doing something similar after https://github.com/mozilla/firefox-translations-training/pull/611 - I'll extend that work.