mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
154 stars 33 forks source link

Models are missing in group logs #874

Open eu9ene opened 3 weeks ago

eu9ene commented 3 weeks ago

Some models are missing from the final evals table in the group logs. It seems this is the case for both online and offline uploading. Here's an example for zh-en. It was tracked online with a slightly outdated code so I'm not sure 100%.

https://wandb.ai/moz-translations/zh-en/runs/group_logs_dziji?nw=nwuserepavlov

The same group was uploaded recently by @vrigal with the latest offline uploader:

https://wandb.ai/teklia/zh-en-taskcluster/runs/group_logs_dziji?nw=nwuserepavlov

Both are missing some models. If we look at the group in Taskcluster, we can see it includes evals for: teacher1, teacher2, teacher-ensemble, student, and finetuned-student. Only some of those models are present in the tables.

https://firefox-ci-tc.services.mozilla.com/tasks/groups/dzijiL-PQ4ScKBB3oIjGQg#eval

I wonder if it can be due to W&B display issues. See #716

vrigal commented 3 weeks ago

The example you mention ran on 2024-09-25T17:32:07.715Z, so I think the code is up-to-date except for the comet score that has been patched (x100). The charts published with runs seems correct, except that I cannot see quantized metrics in the reupload. I'm not sure why this happen with the taskcluster group publication.

Concerning the missing lines in the group table from the online run, it seems like an issue with the table incrementation. Successive versions of the table artifact seems to override the previous row. This is particularly visible with versions 17→19:

Concerning the missing runs the table is published once, but only contains 7 entries (over 40): https://api.wandb.ai/files/teklia/zh-en-taskcluster/group_logs_dziji/media/table/metrics_0_2f49ec3ae32495b07610.table.json This will require more investigation, It may be due to unsupported tasks with the parse_tc_group entrypoint.

I noticed another important issue concerning the reuploading workflow: https://github.com/mozilla/firefox-translations-training/issues/875