Open optimass opened 3 years ago
Leslie commented: Hi optimass,
We currently don't allow for unfinished/crashed runs to be uploaded.
Warmly, Leslie
Leslie commented: Since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!
Hi Leslie!
No sure what is left to be said. I guess it would be nice if there was some support for this :)
Leslie commented: I can create a ticket for this issue :)
Hi, Is there any way to circumvent this issue ? Thanks !
I faced the same issue.
I run the code on the Slurm platform with time limitations. After the dead time comes, the whole run is crushed immediately. After that, I run wandb sync ./wandb/offline-run-xxx
and the figures of metrics are shown in GUI normally, however, the table contains nothing about either config parameters and metrics.
The strange thing is that I also run wandb sync --view --verbose wandb/offline-run-xxx.wandb
and I found the config parameters are actually already contained in the xxx.wandb
file. But the files/config.yaml
file only contains the system parameters like software version information without the needed experiments' parameters.
So I think this issue might be fixed by allowing upload the config parameters saved in xxx.wandb
file to the GUI table? It is so inconvenient to check the xxx.wandb
file for obtaining the config parameters after offline runs crushed.
Hi Leslie!
No sure what is left to be said. I guess it would be nice if there was some support for this :)
I am also in Mila and I faced this issue when I ran my codes on Compute Canada. Were you in the same situation as me? Do you have any solution or idea to figure out it?
I am facing the same challenge as @dqgdqg. Do we have a workaround for this issue?
@samrudhdhirangrej can you share exactly what problem you're facing? If you're script passes a config
value to wandb.init that should be getting persisted in the sync file and config.yaml. Are you able to share your script and a *.wandb
file so we can reproduce?
I am also a computecanada user and run codes on slurm platform. Slurm terminates a job once we exhaust the time limit. Oftentimes, this happens before the training is complete or wandb.finish()
is called.
Since the compute node doesn't have an internet connection, we use an offline mode for wandb and use wandb sync
to sync an incomplete offline run. We cannot see config in wandb overview and Table tabs in this particular case. However, the config is stored in the run-xxxxxxxx.wandb
file.
Hi @vanpelt. I have created a minimal example to explain the issue. I used signal.SIGTERM
to mimic slurm behavior when the time limit is reached. I ran the following code two times, the second time with os.environ["WANDB_MODE"] = "offline"
commented.
import os
import signal
import argparse
import wandb
import time
os.environ["WANDB_MODE"] = "offline"
parser = argparse.ArgumentParser()
parser.add_argument('--seed', type=int, default=0)
parser.add_argument('--epochs', default=100, type=int)
args = parser.parse_args()
wandbid = wandb.util.generate_id()
print(wandbid)
wandb.init(project="test_project", id=wandbid, resume="allow", dir='./')
wandb.config.update(args, allow_val_change=True)
wandb.define_metric('Loss', summary="min")
for ep in range(args.epochs):
wandb.log({'Loss': 0.0}, step=ep)
time.sleep(1)
print(ep)
if ep==30: os.kill(os.getpgid(0), signal.SIGTERM)
After syncing an offline run the Table page is as follows.
The output of the tree command is as follows.
wandb
├── debug-cli.log
├── debug-internal.log -> offline-run-20220328_093814-1q6a1n1p/logs/debug-internal.log
├── debug.log -> offline-run-20220328_093814-1q6a1n1p/logs/debug.log
├── latest-run -> offline-run-20220328_093814-1q6a1n1p
├── offline-run-20220328_093814-1q6a1n1p
│ ├── files
│ │ ├── config.yaml
│ │ ├── output.log
│ │ ├── requirements.txt
│ │ └── wandb-metadata.json
│ ├── logs
│ │ ├── debug-internal.log
│ │ └── debug.log
│ ├── run-1q6a1n1p.wandb
│ ├── tmp
│ │ └── code
│ └── wandb
└── run-20220328_093725-20tgsy55
├── files
│ ├── code
│ │ └── debug_wandb
│ │ └── debug_wandb.py
│ ├── config.yaml
│ ├── diff.patch
│ ├── output.log
│ ├── requirements.txt
│ ├── wandb-metadata.json
│ └── wandb-summary.json
├── logs
│ ├── debug-internal.log
│ └── debug.log
├── run-20tgsy55.wandb
└── tmp
└── code
Note that the arguments seed
and epochs
appear correctly in both wandb/run-20220328_093725-20tgsy55/run-20tgsy55.wandb
and wandb/offline-run-20220328_093814-1q6a1n1p/run-1q6a1n1p.wandb
.
Finally, I am using wandb==0.12.9
.
Thank you.
Leslie commented: We can't sync an incomplete run. Meanwhile we have already filed a feature request. As a workaround you can resume the run on the same file system as the link here states. Once the run finishes, you can sync it to the UI.
Leslie commented: Hi, we wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.
Hi Leslie,
Thank you for your response. The suggested temporary solution works. However, I look forward to seeing this feature added to wandb.
I am facing the same challenge as @dqgdqg. Do we have a workaround for this issue?
Unfortunately, I have no idea till now. Maybe I am going to try Comet.ml or other products like that on slurm platform temporarily.
Leslie commented: Can you tell me what Comet.ml or other products do to implement something like this?
Any update on this issue? I'm facing the same situation. I guess it will be useful if wandb sync .
also sync the configs for the incomplete runs
HI @yihong-chen, currently the ticket is in our queue to do, but I don't have a timeline for it yet
Hello!
As a workaround you can resume the run on the same file system as the link here states.
Which link was this referring to? When attempting to use the example code provided by @samrudhdhirangrej above, I produce an offline killed run which contains no config.yaml
file. If I resume the run with wandb.init(project=project_name, id=run_id, resume='allow')
, this does not sync the config values. What is the current recommended workaround?
Hi,
I've been using the
wandb sync
to upload offline runs lately, usingos.environ['WANDB_MODE'] = 'dryrun'
. Somehow, if the run isn't completed, all of my arguments (uploaded withwandb.config.update(args)
) won't appear till the end of the run. A similar behaviour is happening when I upload metrics: they appear in the figures, but the data only appears in the table at the end of the run.Let me know if you need more information about my setup and/or code.
Thanks!