wandb / wandb

The AI developer platform. Use Weights & Biases to train and fine-tune models, and manage models from experimentation to production.
https://wandb.ai
MIT License
9.16k stars 674 forks source link

[App] wandb sync for offline runs doesn't upload all data until run is finished. #2919

Open optimass opened 3 years ago

optimass commented 3 years ago

Hi,

I've been using the wandb sync to upload offline runs lately, using os.environ['WANDB_MODE'] = 'dryrun'. Somehow, if the run isn't completed, all of my arguments (uploaded with wandb.config.update(args)) won't appear till the end of the run. A similar behaviour is happening when I upload metrics: they appear in the figures, but the data only appears in the table at the end of the run.

Let me know if you need more information about my setup and/or code.

Thanks!

exalate-issue-sync[bot] commented 3 years ago

Leslie commented: Hi optimass,

We currently don't allow for unfinished/crashed runs to be uploaded.

Warmly, Leslie

exalate-issue-sync[bot] commented 3 years ago

Leslie commented: Since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

optimass commented 3 years ago

Hi Leslie!

No sure what is left to be said. I guess it would be nice if there was some support for this :)

exalate-issue-sync[bot] commented 2 years ago

Leslie commented: I can create a ticket for this issue :)

theobdt commented 2 years ago

Hi, Is there any way to circumvent this issue ? Thanks !

dqgdqg commented 2 years ago

I faced the same issue.

I run the code on the Slurm platform with time limitations. After the dead time comes, the whole run is crushed immediately. After that, I run wandb sync ./wandb/offline-run-xxx and the figures of metrics are shown in GUI normally, however, the table contains nothing about either config parameters and metrics.

The strange thing is that I also run wandb sync --view --verbose wandb/offline-run-xxx.wandb and I found the config parameters are actually already contained in the xxx.wandb file. But the files/config.yaml file only contains the system parameters like software version information without the needed experiments' parameters.

So I think this issue might be fixed by allowing upload the config parameters saved in xxx.wandb file to the GUI table? It is so inconvenient to check the xxx.wandb file for obtaining the config parameters after offline runs crushed.

dqgdqg commented 2 years ago

Hi Leslie!

No sure what is left to be said. I guess it would be nice if there was some support for this :)

I am also in Mila and I faced this issue when I ran my codes on Compute Canada. Were you in the same situation as me? Do you have any solution or idea to figure out it?

samrudhdhirangrej commented 2 years ago

I am facing the same challenge as @dqgdqg. Do we have a workaround for this issue?

vanpelt commented 2 years ago

@samrudhdhirangrej can you share exactly what problem you're facing? If you're script passes a config value to wandb.init that should be getting persisted in the sync file and config.yaml. Are you able to share your script and a *.wandb file so we can reproduce?

samrudhdhirangrej commented 2 years ago

I am also a computecanada user and run codes on slurm platform. Slurm terminates a job once we exhaust the time limit. Oftentimes, this happens before the training is complete or wandb.finish() is called. Since the compute node doesn't have an internet connection, we use an offline mode for wandb and use wandb sync to sync an incomplete offline run. We cannot see config in wandb overview and Table tabs in this particular case. However, the config is stored in the run-xxxxxxxx.wandb file.

samrudhdhirangrej commented 2 years ago

Hi @vanpelt. I have created a minimal example to explain the issue. I used signal.SIGTERM to mimic slurm behavior when the time limit is reached. I ran the following code two times, the second time with os.environ["WANDB_MODE"] = "offline" commented.

import os
import signal
import argparse
import wandb
import time
os.environ["WANDB_MODE"] = "offline"

parser = argparse.ArgumentParser()
parser.add_argument('--seed', type=int, default=0)
parser.add_argument('--epochs', default=100, type=int)
args = parser.parse_args()

wandbid = wandb.util.generate_id()
print(wandbid)
wandb.init(project="test_project", id=wandbid, resume="allow", dir='./')
wandb.config.update(args, allow_val_change=True)

wandb.define_metric('Loss', summary="min")
for ep in range(args.epochs):
    wandb.log({'Loss': 0.0}, step=ep)
    time.sleep(1)
    print(ep)
    if ep==30: os.kill(os.getpgid(0), signal.SIGTERM)

After syncing an offline run the Table page is as follows. image

The output of the tree command is as follows.

wandb
├── debug-cli.log
├── debug-internal.log -> offline-run-20220328_093814-1q6a1n1p/logs/debug-internal.log
├── debug.log -> offline-run-20220328_093814-1q6a1n1p/logs/debug.log
├── latest-run -> offline-run-20220328_093814-1q6a1n1p
├── offline-run-20220328_093814-1q6a1n1p
│   ├── files
│   │   ├── config.yaml
│   │   ├── output.log
│   │   ├── requirements.txt
│   │   └── wandb-metadata.json
│   ├── logs
│   │   ├── debug-internal.log
│   │   └── debug.log
│   ├── run-1q6a1n1p.wandb
│   ├── tmp
│   │   └── code
│   └── wandb
└── run-20220328_093725-20tgsy55
    ├── files
    │   ├── code
    │   │   └── debug_wandb
    │   │       └── debug_wandb.py
    │   ├── config.yaml
    │   ├── diff.patch
    │   ├── output.log
    │   ├── requirements.txt
    │   ├── wandb-metadata.json
    │   └── wandb-summary.json
    ├── logs
    │   ├── debug-internal.log
    │   └── debug.log
    ├── run-20tgsy55.wandb
    └── tmp
        └── code

Note that the arguments seed and epochs appear correctly in both wandb/run-20220328_093725-20tgsy55/run-20tgsy55.wandb and wandb/offline-run-20220328_093814-1q6a1n1p/run-1q6a1n1p.wandb.

Finally, I am using wandb==0.12.9.

Thank you.

exalate-issue-sync[bot] commented 2 years ago

Leslie commented: We can't sync an incomplete run. Meanwhile we have already filed a feature request. As a workaround you can resume the run on the same file system as the link here states. Once the run finishes, you can sync it to the UI.

exalate-issue-sync[bot] commented 2 years ago

Leslie commented: Hi, we wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

samrudhdhirangrej commented 2 years ago

Hi Leslie,

Thank you for your response. The suggested temporary solution works. However, I look forward to seeing this feature added to wandb.

dqgdqg commented 2 years ago

I am facing the same challenge as @dqgdqg. Do we have a workaround for this issue?

Unfortunately, I have no idea till now. Maybe I am going to try Comet.ml or other products like that on slurm platform temporarily.

exalate-issue-sync[bot] commented 2 years ago

Leslie commented: Can you tell me what Comet.ml or other products do to implement something like this?

yihong-chen commented 2 years ago

Any update on this issue? I'm facing the same situation. I guess it will be useful if wandb sync . also sync the configs for the incomplete runs

lesliewandb commented 2 years ago

HI @yihong-chen, currently the ticket is in our queue to do, but I don't have a timeline for it yet

golmschenk commented 11 months ago

Hello!

As a workaround you can resume the run on the same file system as the link here states.

Which link was this referring to? When attempting to use the example code provided by @samrudhdhirangrej above, I produce an offline killed run which contains no config.yaml file. If I resume the run with wandb.init(project=project_name, id=run_id, resume='allow'), this does not sync the config values. What is the current recommended workaround?