mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
151 stars 32 forks source link

Tracking does not supports override a run: wandb [409] run was previously created and deleted #875

Open vrigal opened 5 days ago

vrigal commented 5 days ago

Publication from a Taskcluster group using the --overide-runs agrument manages to delete the existing runs of a group, but fails creating new runs:

wandb: ERROR Error while calling W&B API: run teacher-1_dziji was previously created and deleted; try a new run name (<Response [409]>)

Note: It is the ID that conflicts here, and not the name as suggested by above message.

Furthermore, the client stays stuck during 90s

wandb.errors.CommError: Run initialization has timed out after 90.0 sec.

It is annoying because we cannot support identifying runs by unique ID (<name>_<group_id>) and allow overriding a run from an existing project. Unfortunately deleting all artifacts from the project does not seem to fix that. Eventually a quick fix would be to detect such exception and retry with a postfix (name and ID would then be teacher-1_dziji_1, teacher-1_dziji_2…) and it should work (except the display is not ideal and may be confusing, at least consider documenting it).

I think W&B disallow overriding a run because it keep the data to allow a restore of the deleted runs during 7 days (see this issue: https://github.com/wandb/wandb/issues/6395). In the worst scenario we could clean everything (with the --overide-runs) now, then hope reuploading in a week works. It would be interesting to contact the W&B team about this.

I suppose we never detected it since using similar name and IDs for identifying runs in the bar charts.

eu9ene commented 5 days ago

@vrigal does it block us from reuploading? Can I just delete it manually?

vrigal commented 2 days ago

Unfortunately it seems to be the same behavior from the interface (as suggested on their issue). I suggest discussing this with the W&B team, to at least ensure we can override a run after 7 days. Also they may be able to delete the run from the DB directly so we can publish again (or any short term alternative idk, like erasing the content of a run then use the resume client's option).

Example to reproduce:

>>> import wandb
>>> run = wandb.init(project="test", group="test_group", name="test", id="test")
>>> run.finish()
>>> for run in wandb.Api().runs("test", filters={"group": "test_group"}):
>>>     run.delete()
>>> run = wandb.init(project="test", group="test_group", name="test", id="test")
[…]
CommError: Run initialization has timed out after 90.0 sec.