mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
145 stars 31 forks source link

Suffix W&B runs with task group ID for offline Taskcluster publication from GCP #799

Closed vrigal closed 2 weeks ago

vrigal commented 1 month ago

Closes #734

vrigal commented 1 month ago

@eu9ene I ran W&B publication from the GCP folder using this command:

gsutil -m rsync -r -x '.*\.npz$|.*\.gz$' gs://moz-fx-translations-data--303e-prod-translations-data/ en-hu

Here are the corresponding projects in W&B:

I made the wrong assumption that experiments from GCP that ran in Taskcluster are prefixed with baseline_, but I see many other group names (retrain1, retrain2, fix_opt_delay..). Can you provide me a way to detect the task group ID from the folder name ? Otherwise I can add an argument to the GCP experiments entrypoint, which systematically consider the last 22 characters as the group ID.

eu9ene commented 1 month ago

@eu9ene I ran W&B publication from the GCP folder using this command:

gsutil -m rsync -r -x '.*\.npz$|.*\.gz$' gs://moz-fx-translations-data--303e-prod-translations-data/ en-hu

Here are the corresponding projects in W&B:

* https://wandb.ai/teklia/799-lt-en

* https://wandb.ai/teklia/799-en-ru

* https://wandb.ai/teklia/799-en-hu

I made the wrong assumption that experiments from GCP that ran in Taskcluster are prefixed with baseline_, but I see many other group names (retrain1, retrain2, fix_opt_delay..). Can you provide me a way to detect the task group ID from the folder name ? Otherwise I can add an argument to the GCP experiments entrypoint, which systematically consider the last 22 characters as the group ID.

The last 22 characters should work.

eu9ene commented 1 month ago

As we saw today, there's an issue in the naming: backwards_typos_noise_ZaKoAaeqR1GLEnMWRb2tDQ). It should be just backward_ZaKoA (we take the first 5 characters of the group now, right?)

vrigal commented 1 month ago

@eu9ene I added a required --mode argument, to force parsing task group ID for GCP experiments run in taskcluster. I noticed that the train.log files were located in a different folder than for snakemake experiments (in the main logs folder instead of models).

I also had to ignore some old snakemake metrics that cannot be parsed with the generic parser (e.g. eval_student-finetuned_tc_Tatoeba-Challenge-v2021-08-07).

Finally, the naming of the .metrics files was also different. I found no way to retrieve the "importer" part from file names, which usually was the only required to publish metrics (as <importer>_<?augmentation>_<?dataset>). All those example are taken from this gcp folder:

In old GCP experiments (snakemake), .metrics were in a different form:

At the end I decided keeping the existing code for snakemake experiments (which publish correcly importer and dataset), and simply use the file name as importer (which is a required field) for taskcluster GCP experiments. It will not be possible to compare those metrics with task published from the CI though. This topic is complex and a new issue should be dedicated to it in my opinion.

I updated the script so it finds metrics now, and published everything again:

eu9ene commented 1 month ago

Ok, I filed: #809 and https://github.com/mozilla/firefox-translations-training/issues/808

vrigal commented 2 weeks ago

@eu9ene I added a parser in the utils module for GCP task (the augmentation parameter was not supported with the old Snakemake implementation). Everything should be fine now, I published again to our W&B namespace:

eu9ene commented 2 weeks ago

@eu9ene I added a parser in the utils module for GCP task (the augmentation parameter was not supported with the old Snakemake implementation). Everything should be fine now, I published again to our W&B namespace:

* https://wandb.ai/teklia/799-lt-en-TC

* https://wandb.ai/teklia/799-en-hu-TC

* https://wandb.ai/teklia/799-pt-en-SNAKEMAKE

* https://wandb.ai/teklia/799-en-sv-SNAKEMAKE

@vrigal looking at https://wandb.ai/teklia/799-lt-en-TC/runs/group_logs_K1iHn/workspace?nw=nwuserepavlov evals seem to be displayed correclty but group_logs table is missing. It is present in Snakemake runs.

vrigal commented 2 weeks ago

@eu9ene I added a parser in the utils module for GCP task (the augmentation parameter was not supported with the old Snakemake implementation). Everything should be fine now, I published again to our W&B namespace:

* https://wandb.ai/teklia/799-lt-en-TC

* https://wandb.ai/teklia/799-en-hu-TC

* https://wandb.ai/teklia/799-pt-en-SNAKEMAKE

* https://wandb.ai/teklia/799-en-sv-SNAKEMAKE

@vrigal looking at https://wandb.ai/teklia/799-lt-en-TC/runs/group_logs_K1iHn/workspace?nw=nwuserepavlov evals seem to be displayed correclty but group_logs table is missing. It is present in Snakemake runs.

Oh I see, thank you for the catch ! This is the same issue as I mentioned in https://github.com/mozilla/firefox-translations-training/pull/799#issuecomment-2298932937:

train.log files were located in a different folder than for snakemake experiments (in the main logs folder instead of models).

I patched the group_logs hook and had to support model name patch as well (e.g. backwardbackwards, teacher0teacher-0). Taskcluster experiments have been republished to W&B: https://wandb.ai/teklia/799-lt-en-TC and https://wandb.ai/teklia/799-en-hu-TC.

eu9ene commented 2 weeks ago

Tests are failing in CI:

FAILED tests/test_tracking_cli.py::test_experiments_marian_1_10 - assert {(20...
[task 2024-09-10T12:38:09.993Z] FAILED tests/test_tracking_cli.py::test_experiments_marian_1_12 - assert {(20...
[task 2024-09-10T12:38:09.993Z] ============ 2 failed, 140 passed, 3 warnings in 427.77s (0:07:07) =============
vrigal commented 2 weeks ago

Tests should be fine now. I made a few changes:

eu9ene commented 2 weeks ago

@vrigal I noticed that some of the models don't have config parameters (marian, arguments), for example https://wandb.ai/teklia/799-lt-en-TC/runs/teacher-0_H396I/overview

vrigal commented 2 weeks ago

Well, I also noticed a missing run: group_logs_ZaKoA. This was due to a bug in the GCP structure browsing. I had to rewrite this part and reupload everything to W&B. Everything seems coherent now. @eu9ene It would be nice to manage merging this, so I keep little time to make the upload to our workspace then we do #574 (I should have 1 working day left to spend on it). By the way, can you provide a list of Taskcluster group IDs and GCP experiments, so I can erase everything on https://wandb.ai/teklia/projects and reupload those examples ?

eu9ene commented 2 weeks ago

Well, I also noticed a missing run: group_logs_ZaKoA. This was due to a bug in the GCP structure browsing. I had to rewrite this part and reupload everything to W&B. Everything seems coherent now. @eu9ene It would be nice to manage merging this, so I keep little time to make the upload to our workspace then we do #574 (I should have 1 working day left to spend on it). By the way, can you provide a list of Taskcluster group IDs and GCP experiments, so I can erase everything on https://wandb.ai/teklia/projects and reupload those examples ?

Ok, I looked briefly and noticed several teacher-finetune steps for https://wandb.ai/teklia/799-en-hu-TC/groups/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/workspace?nw=nwuserepavlov but I think we can skip this since it's an old experiment and we don't have this step now.