mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

W&B runs published from GCP experiments should be suffixed with the Task Group ID when possible #734

Closed vrigal closed 1 month ago

vrigal commented 3 months ago

Depends on #727

Runs published from GCP experiments should be suffixed with the 5 first characters of the task group (when the "group" folder contains the task group ID, see example below). Please use the generic method translations_parser.utils.suffix_from_group.

Refs. https://github.com/mozilla/firefox-translations-training/pull/727#discussion_r1674807729:

For offline uploading of Taskcluster tasks from GCP, we'll want to apply the same strategy as for other Taskcluster tasks. They should be in the folders like baseline_enhu_PuI6mYZPTUqAfyZMTgeUng. In this case, we can parse it and again use the group ID as a suffix (teacher-1_PuI6m).

vrigal commented 3 months ago

@eu9ene are you aware of a method to parse a task group ID from folders corresponding to the group name ? Maybe from the taskcluster package or using a regex ?

eu9ene commented 2 months ago

Here's an example of the folders we have on the production GCP bucket:

gsutil ls gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_S5E71GihQM6Te_KdrUmATw/
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_aY25-4fXTcuJNuMcWXUYtQ/
gsutil ls gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/evaluation
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/evaluation/teacher-ensemble/
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/evaluation/teacher-finetuned0/
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/evaluation/teacher-finetuned1/
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/evaluation/teacher0/
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-hu/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/evaluation/teacher1/

The name of the folder is <config_experiment_name>_<taskcluster_group_id>. The structure should be similar to what we had on the dev GCP bucket but we should double-check the naming and that everything is being parsed correctly.