Suffix W&B runs with task group ID for offline Taskcluster publication from GCP

vrigal commented 2 months ago

Closes #734

vrigal commented 2 months ago

@eu9ene I ran W&B publication from the GCP folder using this command:

gsutil -m rsync -r -x '.*\.npz$|.*\.gz$' gs://moz-fx-translations-data--303e-prod-translations-data/ en-hu

Here are the corresponding projects in W&B:

I made the wrong assumption that experiments from GCP that ran in Taskcluster are prefixed with baseline_, but I see many other group names (retrain1, retrain2, fix_opt_delay..). Can you provide me a way to detect the task group ID from the folder name ? Otherwise I can add an argument to the GCP experiments entrypoint, which systematically consider the last 22 characters as the group ID.

eu9ene commented 2 months ago

@eu9ene I ran W&B publication from the GCP folder using this command:
gsutil -m rsync -r -x '.*\.npz$|.*\.gz$' gs://moz-fx-translations-data--303e-prod-translations-data/ en-hu
Here are the corresponding projects in W&B:
* https://wandb.ai/teklia/799-lt-en

* https://wandb.ai/teklia/799-en-ru

* https://wandb.ai/teklia/799-en-hu
I made the wrong assumption that experiments from GCP that ran in Taskcluster are prefixed with baseline_, but I see many other group names (retrain1, retrain2, fix_opt_delay..). Can you provide me a way to detect the task group ID from the folder name ? Otherwise I can add an argument to the GCP experiments entrypoint, which systematically consider the last 22 characters as the group ID.

The last 22 characters should work.

eu9ene commented 2 months ago

As we saw today, there's an issue in the naming: backwards_typos_noise_ZaKoAaeqR1GLEnMWRb2tDQ). It should be just backward_ZaKoA (we take the first 5 characters of the group now, right?)

vrigal commented 2 months ago

@eu9ene I added a required --mode argument, to force parsing task group ID for GCP experiments run in taskcluster. I noticed that the train.log files were located in a different folder than for snakemake experiments (in the main logs folder instead of models).

I also had to ignore some old snakemake metrics that cannot be parsed with the generic parser (e.g. eval_student-finetuned_tc_Tatoeba-Challenge-v2021-08-07).

Finally, the naming of the .metrics files was also different. I found no way to retrieve the "importer" part from file names, which usually was the only required to publish metrics (as <importer>_<?augmentation>_<?dataset>). All those example are taken from this gcp folder:

student/aug-mix_Neulab-tedtalks_test-1-eng-lit.metrics -> {"importer": ?, dataset: "Neulab-tedtalks_test-1-eng-lit" "aug": "aug-mix"}
student-finetuned/aug-upper-strict_wmt19.metrics -> {"importer": ?, "dataset": "wmt19", "aug": "aug-upper-strict"}
student-finetuned/devtest.metrics -> {"importer": ?, "dataset": "devtest", "aug": X}

In old GCP experiments (snakemake), .metrics were in a different form:

models/pt-en/test/evaluation1/backward/test.metrics -> {"importer": "test", "dataset": X, "aug": X}
models/pt-en/test/evaluation1/backward/flores_devtest.metrics -> {"importer": "flores", "dataset": "devtest", "aug": X}
models/en-sv/opusmt-multimodel-test/evaluation/student/tc_Tatoeba-Challenge-v2021-08-07.metrics -> {"importer": "tc", "dataset": "Tatoeba-Challenge-v2021-08-07", "aug": X}

At the end I decided keeping the existing code for snakemake experiments (which publish correcly importer and dataset), and simply use the file name as importer (which is a required field) for taskcluster GCP experiments. It will not be possible to compare those metrics with task published from the CI though. This topic is complex and a new issue should be dedicated to it in my opinion.

I updated the script so it finds metrics now, and published everything again:

With GCP experiments from Taskcluster (parse_experiment_dir -d gcp_taskcluster -m taskcluster):
With GCP experiments from snakemake (parse_experiment_dir -d gcp_snakemake -m taskcluster):
- https://wandb.ai/teklia/799-old-en-sv
- https://wandb.ai/teklia/799-old-pt-en

eu9ene commented 2 months ago

Ok, I filed: #809 and https://github.com/mozilla/firefox-translations-training/issues/808

vrigal commented 1 month ago

@eu9ene I added a parser in the utils module for GCP task (the augmentation parameter was not supported with the old Snakemake implementation). Everything should be fine now, I published again to our W&B namespace:

eu9ene commented 1 month ago

@eu9ene I added a parser in the utils module for GCP task (the augmentation parameter was not supported with the old Snakemake implementation). Everything should be fine now, I published again to our W&B namespace:
* https://wandb.ai/teklia/799-lt-en-TC

* https://wandb.ai/teklia/799-en-hu-TC

* https://wandb.ai/teklia/799-pt-en-SNAKEMAKE

* https://wandb.ai/teklia/799-en-sv-SNAKEMAKE

@vrigal looking at https://wandb.ai/teklia/799-lt-en-TC/runs/group_logs_K1iHn/workspace?nw=nwuserepavlov evals seem to be displayed correclty but group_logs table is missing. It is present in Snakemake runs.

vrigal commented 1 month ago

@eu9ene I added a parser in the utils module for GCP task (the augmentation parameter was not supported with the old Snakemake implementation). Everything should be fine now, I published again to our W&B namespace:
* https://wandb.ai/teklia/799-lt-en-TC

* https://wandb.ai/teklia/799-en-hu-TC

* https://wandb.ai/teklia/799-pt-en-SNAKEMAKE

* https://wandb.ai/teklia/799-en-sv-SNAKEMAKE
@vrigal looking at https://wandb.ai/teklia/799-lt-en-TC/runs/group_logs_K1iHn/workspace?nw=nwuserepavlov evals seem to be displayed correclty but group_logs table is missing. It is present in Snakemake runs.

Oh I see, thank you for the catch ! This is the same issue as I mentioned in https://github.com/mozilla/firefox-translations-training/pull/799#issuecomment-2298932937:

train.log files were located in a different folder than for snakemake experiments (in the main logs folder instead of models).

I patched the group_logs hook and had to support model name patch as well (e.g. backward → backwards, teacher0 → teacher-0). Taskcluster experiments have been republished to W&B: https://wandb.ai/teklia/799-lt-en-TC and https://wandb.ai/teklia/799-en-hu-TC.

eu9ene commented 1 month ago

Tests are failing in CI:

FAILED tests/test_tracking_cli.py::test_experiments_marian_1_10 - assert {(20...
[task 2024-09-10T12:38:09.993Z] FAILED tests/test_tracking_cli.py::test_experiments_marian_1_12 - assert {(20...
[task 2024-09-10T12:38:09.993Z] ============ 2 failed, 140 passed, 3 warnings in 427.77s (0:07:07) =============

vrigal commented 1 month ago

Tests should be fine now. I made a few changes:

Add a value to metrics parser ValueError exceptions.
Avoid stopping publication on metric parsing failure.
A last patch to prevent Snakemake experments to look at models directory for metrics.

eu9ene commented 1 month ago

@vrigal I noticed that some of the models don't have config parameters (marian, arguments), for example https://wandb.ai/teklia/799-lt-en-TC/runs/teacher-0_H396I/overview

vrigal commented 1 month ago

Well, I also noticed a missing run: group_logs_ZaKoA. This was due to a bug in the GCP structure browsing. I had to rewrite this part and reupload everything to W&B. Everything seems coherent now. @eu9ene It would be nice to manage merging this, so I keep little time to make the upload to our workspace then we do #574 (I should have 1 working day left to spend on it). By the way, can you provide a list of Taskcluster group IDs and GCP experiments, so I can erase everything on https://wandb.ai/teklia/projects and reupload those examples ?

eu9ene commented 1 month ago

Well, I also noticed a missing run: group_logs_ZaKoA. This was due to a bug in the GCP structure browsing. I had to rewrite this part and reupload everything to W&B. Everything seems coherent now. @eu9ene It would be nice to manage merging this, so I keep little time to make the upload to our workspace then we do #574 (I should have 1 working day left to spend on it). By the way, can you provide a list of Taskcluster group IDs and GCP experiments, so I can erase everything on https://wandb.ai/teklia/projects and reupload those examples ?

Ok, I looked briefly and noticed several teacher-finetune steps for https://wandb.ai/teklia/799-en-hu-TC/groups/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/workspace?nw=nwuserepavlov but I think we can skip this since it's an old experiment and we don't have this step now.

mozilla / translations

Suffix W&B runs with task group ID for offline Taskcluster publication from GCP #799