wandb / wandb

The AI developer platform. Use Weights & Biases to train and fine-tune models, and manage models from experimentation to production.
https://wandb.ai
MIT License
9.14k stars 671 forks source link

WandbCallback fails on TPU[CLI] #2672

Closed harveenchadha closed 1 year ago

harveenchadha commented 3 years ago

Description Adding Keras WandbCallback fails and throws an error.

Screenshot from 2021-09-15 21-32-25

vanpelt commented 3 years ago

@harveenchadha can you confirm what version of wandb you're using by running wandb --version?

harveenchadha commented 3 years ago

I am on the latest version:

0.12.2

vanpelt commented 3 years ago

@harveenchadha if you're able to share a colab that can reproduce this that would really help us get to the bottom of it. We think it's because the summary is too large for us to sync in the current time we allow. The next release of the library should increase this timeout, but if we can reproduce ourselves before than we can likely get you a workaround.

vanpelt commented 3 years ago

My email is vanpelt@wandb.com if you want to share a private colab.

harveenchadha commented 3 years ago

I can give you a public kaggle kernel

Here you will find the error.

If something is wrong with my code structure, do let me know!

vanpelt commented 3 years ago

Hey @harveenchadha we'll need to look into a fix for this ahead of the next release. Until then, I think the simplest solution would be to use the keras Tensorboard callback and just add sync_tensorboard=True to wandb.init.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity.

MiniXC commented 1 year ago

Is this still an issue? Running into a similar problem on TPU (when using wandb with pytorch lightning):

Exception in device=TPU:0: problem                                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                            
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 331, in _mp_start_fn                                                                       
    _start_fn(index, pf_cfg, fn, args)                                                                                                                                                        
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 325, in _start_fn                                                                          
    fn(gindex, *args)                                                                                                                                                                         
  File "/home/christoph.minixhofer/lightning/src/pytorch_lightning/strategies/launchers/xla.py", line 100, in _wrapping_function                                                              
    results = function(*args, **kwargs)                                                                                                                                                       
  File "/home/christoph.minixhofer/lightning/src/pytorch_lightning/trainer/trainer.py", line 644, in _fit_impl                                                                                
    self._run(model, ckpt_path=self.ckpt_path)                                                                                                                                                
  File "/home/christoph.minixhofer/lightning/src/pytorch_lightning/trainer/trainer.py", line 1085, in _run                                                                                    
    self._log_hyperparams()                                                                                                                                                                   
  File "/home/christoph.minixhofer/lightning/src/pytorch_lightning/trainer/trainer.py", line 1153, in _log_hyperparams                                                                        
    logger.log_hyperparams(hparams_initial)                                                                                                                                                   
  File "/home/christoph.minixhofer/.local/lib/python3.8/site-packages/lightning_utilities/core/rank_zero.py", line 24, in wrapped_fn                                                          
    return fn(*args, **kwargs)                                                                                                                                                                
  File "/home/christoph.minixhofer/lightning/src/pytorch_lightning/loggers/wandb.py", line 426, in log_hyperparams                                                                            
    self.experiment.config.update(params, allow_val_change=True)                                                                                                                              
  File "/home/christoph.minixhofer/.local/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 253, in wrapper                                                                           
    raise e                                                                                                                                                                                   
  File "/home/christoph.minixhofer/.local/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 248, in wrapper                                                                           
    wandb._attach(run=self)                                                                                                                                                                   
  File "/home/christoph.minixhofer/.local/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 841, in _attach                                                                          
    raise UsageError("problem")                                                                                                                                                               
wandb.errors.UsageError: problem
mx781 commented 1 year ago

@MiniXC and anyone else stumbling upon this (as it seems to be the lone search result for the mysterious "wandb.errors.UsageError: problem") - upgrading pytorch lightning to 1.9.0 and wandb to 0.13.10 solved this issue for me.

anmolmann commented 1 year ago

Hi @MiniXC, I wanted to follow up on this request. Were you able to find the solution provided by @mx781 helpful? Please let us know if we can be of further assistance or if your issue has been resolved.

anmolmann commented 1 year ago

Hi @MiniXC , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!