Closed eu9ene closed 2 months ago
We should fix this by shortening the name somehow. We could run these names through some function that guarantees:
@eu9ene, @gregtatum - how would you feel about taking the first N characters of datasets, truncating the rest, and adding an md5 (or similar) hash of the full dataset to the end? This would keep the datasets reasonable readable and ensure uniqueness without needing to maintain some sort of lookup table.
Sounds good. I don't think the label is particularly important for training datasets. For evals, we use short dataset names.
Oh, I see we're already doing something similar after https://github.com/mozilla/firefox-translations-training/pull/611 - I'll extend that work.
It's related to the Taskcluster limitation of the label size (256 characters)
Example: