mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Duplicate runs in W&B #729

Closed eu9ene closed 3 months ago

eu9ene commented 3 months ago

https://wandb.ai/moz-translations/tr-en/workspace?nw=nwuserepavlov

https://firefox-ci-tc.services.mozilla.com/tasks/groups/SDD81N6sRu61LOL4xZJc-Q

Screenshot 2024-07-09 at 11 54 51 AM
eu9ene commented 3 months ago

Same here https://wandb.ai/moz-translations/vi-en/workspace?nw=nwuserepavlov https://firefox-ci-tc.services.mozilla.com/tasks/groups/Nc0SHbrgQaiFt4_FmKBXOA

I guess it started recently. It seems evals tasks are causing it

vrigal commented 3 months ago

This is a serious bug, as all the following evaluation tasks fail with this message:

Multiple W&B runs already exist with name 'teacher-1': [<Run moz-translations/tr-en/ged9nebc (finished)>, <Run moz-translations/tr-en/shovl7mn (finished)>, …]. No data will be published.

This may be caused by the current implementation, that list existing runs before publishing (as the ID is required to resume a run in W&B). A race condition is possible among 2+ tasks (those tasks ran in ~3minutes), creating multiple runs with a similar display name, then causing this bug.

This somehow confirms our approach using unique run IDs (& name) is certainly the way to go.