wandb not initialized in train_model.py

mllam / neural-lam

Neural Weather Prediction for Limited Area Modeling

MIT License

64 stars 25 forks source link

wandb not initialized in train_model.py #5

Closed mpvginde closed 7 months ago

mpvginde commented 7 months ago

Hi,

I'm currently doing some first tests with the train-model.py script. I'm quite new to wandb so I might have missed something, but it seems that wandb.init('neural-lam') is never called, which leads to the following error:
wandb.errors.Error: You must call wandb.init() before wandb.define_metric() which traces back to /neural_lam/utils.py", line 203, in init_wandb_metrics.

Adding wandb.init('neural-lam') here:

    if trainer.global_rank == 0:
        wandb.init(project="neural-lam")
        utils.init_wandb_metrics() # Do after wandb.init

seems to work, but I guess the name of the project should be read from constants.py.

Kind regards, Michiel

joeloskarsson commented 7 months ago

Hi, wandb.init should be called under the hood by the WandbLogger https://github.com/joeloskarsson/neural-lam/blob/6377d447d41e9828d4c9ee4c1fd13964f1c22d20/train_model.py#L127-L128, so there should not be any need to do this manually (lightning docs: https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.wandb.html#lightning.pytorch.loggers.wandb.WandbLogger).

I remember having this issue earlier when I was working out the details around multi-GPU training, but sorted it out then (that is why the if trainer.global_rank == 0: is there). I can't seem to get this error message even if I log out of wandb. Are you on the latest commit? What kind of hardware are you running on (cpu/single-gpu/multi-gpu)?

mpvginde commented 7 months ago

Hi Joel, thanks for your reply. I'm running on a single GPU (interactive PBS job on a GPU-cluster where I only ask for 1 GPU). And using commit 6377d44

joeloskarsson commented 7 months ago

Some detective work later I think I have figured out the issue. The WandbLogger used to call wandb.init when created (wandb.init is called the first time the experiment property of the logger object is accessed. But this was changed here https://github.com/Lightning-AI/lightning/commit/71559b6768653212750dd0c653dc64f259e1bbd1 (a small change for lightning, but it does brake things here). I think this changed was included in lightning 2.1.1., but my environment was still on 2.0.9.

So this does indeed need fixing. Thanks for raising this issue, as it will likely affect anyone making a new install. I think that letting the logger do the wandb.init is still a better idea than an explicit wandb.init call. A nice way to do this would be to use a logger.experiment call to set up the metrics (as that will then make sure 'wandb.init` is called). Will take a look at this in a bit!

mpvginde commented 7 months ago

Indeed I'm using Lightning 2.1.0. You might specify pytorch-lightning>=2.0.3,<=2.0.9 in the readme as a temporary fix. I will check if downgrading fixes the error. Thanks for the detective work.

joeloskarsson commented 7 months ago

Should be fixed in https://github.com/joeloskarsson/neural-lam/commit/9912ece7f54a14b3cdfbad1735e460d2bd392dfc now, so a git pull should be enough :smile: Let me know if not.