Closed bhearsum closed 5 months ago
@La0 discovered that this was introduced by https://github.com/mozilla/firefox-translations-training/pull/500, which I believe is the (new) longest named task that also contains the dataset name. Some adjustments in https://github.com/mozilla/firefox-translations-training/blob/main/taskcluster/translations_taskgraph/util/dataset_helpers.py ought to fix this. cc @gregtatum
Hmm... I guess it would be nice to figure out a way to express this in terms of a custom name for the data sets like dataset-custom_catalan_mono_corpus
rather than derive it from a URL, which could have duplicate filenames.
The initial idea of this importer was to specify any URL or file path and get the dataset from there. Now we don't think much about compatibility with Snakemake and it's unlikely that we'll use any unsupported external sources beyond our GCP bucket. Also this label is being used a lot more than in the old Snakemake pipeline. So I propose to simplify this importer and make it similar to the others like OPUS. Basically we would have an expected structure on GCP bucket and would specify only the dataset name. For example:
gs://releng-translations-dev/data/ru-en/pytest-dataset.zst
would correspond to
datasets:
train:
gcp_pytest-dataset
The Taskcluster label then would be dataset-gcp_pytest-dataset-ru-en
We can also implement it as a separate importer as id does a slightly different thing.
If we want to preserve the URL importer we can specify a base URL separately as a config option and use only the file name in the datasets
and in the label. Then we can still use other URLs but I don't know if it's useful now. Maybe if/when some external people that don't access to the GCP bucket would want to train on their own datasets.
Another instance of this issue: https://firefox-ci-tc.services.mozilla.com/tasks/QOikHCOqSdKc3kJQYqnnBA/runs/0/logs/public/logs/live.log. We should fix it.
Right now, these tasks have labels like:
dataset-url-https_storage_googleapis_com_releng-translations-dev_data_en-ru_pytest-dataset__LANG__zst-ru-en
-- which have a lot of unnecessary information IMO. More pressingly, because task labels are included in cached task routes, we bump up against the 249 character limit for route keys if the URL (or even a branch name) is too long.I suggest we shrink this down to the filename only, or if strictly necessary, the leading directory (which is the locale pair).
@eugene - any thoughts?