Closed vrigal closed 1 month ago
@eu9ene I ran W&B publication from the GCP folder using this command:
gsutil -m rsync -r -x '.*\.npz$|.*\.gz$' gs://moz-fx-translations-data--303e-prod-translations-data/ en-hu
Here are the corresponding projects in W&B:
I made the wrong assumption that experiments from GCP that ran in Taskcluster are prefixed with baseline_
, but I see many other group names (retrain1
, retrain2
, fix_opt_delay
..).
Can you provide me a way to detect the task group ID from the folder name ? Otherwise I can add an argument to the GCP experiments entrypoint, which systematically consider the last 22 characters as the group ID.
@eu9ene I ran W&B publication from the GCP folder using this command:
gsutil -m rsync -r -x '.*\.npz$|.*\.gz$' gs://moz-fx-translations-data--303e-prod-translations-data/ en-hu
Here are the corresponding projects in W&B:
* https://wandb.ai/teklia/799-lt-en * https://wandb.ai/teklia/799-en-ru * https://wandb.ai/teklia/799-en-hu
I made the wrong assumption that experiments from GCP that ran in Taskcluster are prefixed with
baseline_
, but I see many other group names (retrain1
,retrain2
,fix_opt_delay
..). Can you provide me a way to detect the task group ID from the folder name ? Otherwise I can add an argument to the GCP experiments entrypoint, which systematically consider the last 22 characters as the group ID.
The last 22 characters should work.
As we saw today, there's an issue in the naming: backwards_typos_noise_ZaKoAaeqR1GLEnMWRb2tDQ)
. It should be just backward_ZaKoA
(we take the first 5 characters of the group now, right?)
@eu9ene I added a required --mode
argument, to force parsing task group ID for GCP experiments run in taskcluster.
I noticed that the train.log
files were located in a different folder than for snakemake experiments (in the main logs
folder instead of models
).
I also had to ignore some old snakemake metrics that cannot be parsed with the generic parser (e.g. eval_student-finetuned_tc_Tatoeba-Challenge-v2021-08-07
).
Finally, the naming of the .metrics
files was also different. I found no way to retrieve the "importer" part from file names, which usually was the only required to publish metrics (as <importer>_<?augmentation>_<?dataset>
).
All those example are taken from this gcp folder:
student/aug-mix_Neulab-tedtalks_test-1-eng-lit.metrics
-> {"importer": ?, dataset: "Neulab-tedtalks_test-1-eng-lit" "aug": "aug-mix"}student-finetuned/aug-upper-strict_wmt19.metrics
-> {"importer": ?, "dataset": "wmt19", "aug": "aug-upper-strict"}student-finetuned/devtest.metrics
-> {"importer": ?, "dataset": "devtest", "aug": X}In old GCP experiments (snakemake), .metrics
were in a different form:
models/pt-en/test/evaluation1/backward/test.metrics
-> {"importer": "test", "dataset": X, "aug": X}models/pt-en/test/evaluation1/backward/flores_devtest.metrics
-> {"importer": "flores", "dataset": "devtest", "aug": X}models/en-sv/opusmt-multimodel-test/evaluation/student/tc_Tatoeba-Challenge-v2021-08-07.metrics
-> {"importer": "tc", "dataset": "Tatoeba-Challenge-v2021-08-07", "aug": X}At the end I decided keeping the existing code for snakemake experiments (which publish correcly importer and dataset), and simply use the file name as importer (which is a required field) for taskcluster GCP experiments. It will not be possible to compare those metrics with task published from the CI though. This topic is complex and a new issue should be dedicated to it in my opinion.
I updated the script so it finds metrics now, and published everything again:
parse_experiment_dir -d gcp_taskcluster -m taskcluster
):
parse_experiment_dir -d gcp_snakemake -m taskcluster
):
Ok, I filed: #809 and https://github.com/mozilla/firefox-translations-training/issues/808
@eu9ene I added a parser in the utils
module for GCP task (the augmentation parameter was not supported with the old Snakemake implementation). Everything should be fine now, I published again to our W&B namespace:
@eu9ene I added a parser in the
utils
module for GCP task (the augmentation parameter was not supported with the old Snakemake implementation). Everything should be fine now, I published again to our W&B namespace:* https://wandb.ai/teklia/799-lt-en-TC * https://wandb.ai/teklia/799-en-hu-TC * https://wandb.ai/teklia/799-pt-en-SNAKEMAKE * https://wandb.ai/teklia/799-en-sv-SNAKEMAKE
@vrigal looking at https://wandb.ai/teklia/799-lt-en-TC/runs/group_logs_K1iHn/workspace?nw=nwuserepavlov evals seem to be displayed correclty but group_logs table is missing. It is present in Snakemake runs.
@eu9ene I added a parser in the
utils
module for GCP task (the augmentation parameter was not supported with the old Snakemake implementation). Everything should be fine now, I published again to our W&B namespace:* https://wandb.ai/teklia/799-lt-en-TC * https://wandb.ai/teklia/799-en-hu-TC * https://wandb.ai/teklia/799-pt-en-SNAKEMAKE * https://wandb.ai/teklia/799-en-sv-SNAKEMAKE
@vrigal looking at https://wandb.ai/teklia/799-lt-en-TC/runs/group_logs_K1iHn/workspace?nw=nwuserepavlov evals seem to be displayed correclty but group_logs table is missing. It is present in Snakemake runs.
Oh I see, thank you for the catch ! This is the same issue as I mentioned in https://github.com/mozilla/firefox-translations-training/pull/799#issuecomment-2298932937:
train.log files were located in a different folder than for snakemake experiments (in the main logs folder instead of models).
I patched the group_logs hook and had to support model name patch as well (e.g. backward
→ backwards
, teacher0
→ teacher-0
).
Taskcluster experiments have been republished to W&B: https://wandb.ai/teklia/799-lt-en-TC and https://wandb.ai/teklia/799-en-hu-TC.
Tests are failing in CI:
FAILED tests/test_tracking_cli.py::test_experiments_marian_1_10 - assert {(20...
[task 2024-09-10T12:38:09.993Z] FAILED tests/test_tracking_cli.py::test_experiments_marian_1_12 - assert {(20...
[task 2024-09-10T12:38:09.993Z] ============ 2 failed, 140 passed, 3 warnings in 427.77s (0:07:07) =============
Tests should be fine now. I made a few changes:
ValueError
exceptions.models
directory for metrics.@vrigal I noticed that some of the models don't have config parameters (marian, arguments), for example https://wandb.ai/teklia/799-lt-en-TC/runs/teacher-0_H396I/overview
Well, I also noticed a missing run: group_logs_ZaKoA
. This was due to a bug in the GCP structure browsing. I had to rewrite this part and reupload everything to W&B.
Everything seems coherent now. @eu9ene It would be nice to manage merging this, so I keep little time to make the upload to our workspace then we do #574 (I should have 1 working day left to spend on it).
By the way, can you provide a list of Taskcluster group IDs and GCP experiments, so I can erase everything on https://wandb.ai/teklia/projects and reupload those examples ?
Well, I also noticed a missing run:
group_logs_ZaKoA
. This was due to a bug in the GCP structure browsing. I had to rewrite this part and reupload everything to W&B. Everything seems coherent now. @eu9ene It would be nice to manage merging this, so I keep little time to make the upload to our workspace then we do #574 (I should have 1 working day left to spend on it). By the way, can you provide a list of Taskcluster group IDs and GCP experiments, so I can erase everything on https://wandb.ai/teklia/projects and reupload those examples ?
Ok, I looked briefly and noticed several teacher-finetune
steps for https://wandb.ai/teklia/799-en-hu-TC/groups/baseline_enhu_PuI6mYZPTUqAfyZMTgeUng/workspace?nw=nwuserepavlov but I think we can skip this since it's an old experiment and we don't have this step now.
Closes #734