jdb78 / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.75k stars 599 forks source link

Error when Using Distributed GPU Processing #103

Open AlexMRuch opened 3 years ago

AlexMRuch commented 3 years ago

When I initialize my TFT trainer to use multiple GPUs

# Configure network and trainer
pl.seed_everything(407)
trainer = pl.Trainer(
    gpus = [0],
    gradient_clip_val = 0.1  # hyperparam to prevent gradient divergance for RNNs
)

tft = TemporalFusionTransformer.from_dataset(
    training,
    # not meaningful for finding the learning rate but otherwise very important
    learning_rate = 0.03,
    hidden_size = 16,  # most important hyperparameter apart from learning rate
    # number of attention heads. Set to up to 4 for large datasets
    attention_head_size = 1,
    dropout = 0.1,  # between 0.1 and 0.3 are good values
    hidden_continuous_size = 8,  # set to <= hidden_size
    output_size = 7,  # 7 quantiles by default
    loss = QuantileLoss(),
    # reduce learning rate if no improvement in validation loss after x epochs
    reduce_on_plateau_patience = 4,
)
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")

The library is able to recognize that I used both GPUs

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1]
Number of parameters in network: 23.4k

However, when I try to find the optimal learning rate

# Find optimal learning rate
res = trainer.lr_find(
    tft,
    train_dataloader = train_dataloader,
    val_dataloaders = val_dataloader,
    max_lr = 10.,
    min_lr = 1e-6,
)

print(f"Suggested learning rate: {res.suggestion()}")
fig = res.plot(show = True, suggest = True)
fig.show()

I get an AttributeError: Can't pickle local object '_apply_to_outputs.<locals>.decorator_fn.<locals>.new_func' error with the following trace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-29-01060df08a43> in <module>
      1 # Find optimal learning rate
----> 2 res = trainer.lr_find(
      3     tft,
      4     train_dataloader = train_dataloader,
      5     val_dataloaders = val_dataloader,

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/lr_finder.py in lr_find(self, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold)
    198 
    199         # Fit, lr & loss logged in callback
--> 200         self.fit(model,
    201                  train_dataloader=train_dataloader,
    202                  val_dataloaders=val_dataloaders)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py in wrapped_fn(self, *args, **kwargs)
     46             if entering is not None:
     47                 self.state = entering
---> 48             result = fn(self, *args, **kwargs)
     49 
     50             # The INTERRUPTED state can be set inside the run function. To indicate that run was interrupted

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
   1050             self.accelerator_backend = DDPSpawnBackend(self)
   1051             self.accelerator_backend.setup()
-> 1052             self.accelerator_backend.train(model, nprocs=self.num_processes)
   1053             results = self.accelerator_backend.teardown(model)
   1054 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_backend.py in train(self, model, nprocs)
     41 
     42     def train(self, model, nprocs):
---> 43         mp.spawn(self.ddp_train, nprocs=nprocs, args=(self.mp_queue, model,))
     44 
     45     def teardown(self, model):

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon)
    160             daemon=daemon,
    161         )
--> 162         process.start()
    163         error_queues.append(error_queue)
    164         processes.append(process)

~/anaconda3/envs/forecasting/lib/python3.8/multiprocessing/process.py in start(self)
    119                'daemonic processes are not allowed to have children'
    120         _cleanup()
--> 121         self._popen = self._Popen(self)
    122         self._sentinel = self._popen.sentinel
    123         # Avoid a refcycle if the target function holds an indirect

~/anaconda3/envs/forecasting/lib/python3.8/multiprocessing/context.py in _Popen(process_obj)
    282         def _Popen(process_obj):
    283             from .popen_spawn_posix import Popen
--> 284             return Popen(process_obj)
    285 
    286     class ForkServerProcess(process.BaseProcess):

~/anaconda3/envs/forecasting/lib/python3.8/multiprocessing/popen_spawn_posix.py in __init__(self, process_obj)
     30     def __init__(self, process_obj):
     31         self._fds = []
---> 32         super().__init__(process_obj)
     33 
     34     def duplicate_for_child(self, fd):

~/anaconda3/envs/forecasting/lib/python3.8/multiprocessing/popen_fork.py in __init__(self, process_obj)
     17         self.returncode = None
     18         self.finalizer = None
---> 19         self._launch(process_obj)
     20 
     21     def duplicate_for_child(self, fd):

~/anaconda3/envs/forecasting/lib/python3.8/multiprocessing/popen_spawn_posix.py in _launch(self, process_obj)
     45         try:
     46             reduction.dump(prep_data, fp)
---> 47             reduction.dump(process_obj, fp)
     48         finally:
     49             set_spawning_popen(None)

~/anaconda3/envs/forecasting/lib/python3.8/multiprocessing/reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object '_apply_to_outputs.<locals>.decorator_fn.<locals>.new_func'

Any idea what may be triggering this? My guess is that because I'm not distributing across multiple machines, the pickle is getting messed up. That's fine and just indicates I misunderstood that setting for distributed_backend, but moving on, I hit errors with the other distributed_backend settings as well.

Following https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes, when I hard-code distributed_backend to ddp2, I get this trace

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp2_backend.py in _resolve_task_idx(self)
     52             try:
---> 53                 self.task_idx = int(os.environ['LOCAL_RANK'])
     54             except Exception as e:

~/anaconda3/envs/forecasting/lib/python3.8/os.py in __getitem__(self, key)
    674             # raise KeyError with the original key value
--> 675             raise KeyError(key) from None
    676         return self.decodevalue(value)

KeyError: 'LOCAL_RANK'

During handling of the above exception, another exception occurred:

MisconfigurationException                 Traceback (most recent call last)
<ipython-input-29-01060df08a43> in <module>
      1 # Find optimal learning rate
----> 2 res = trainer.lr_find(
      3     tft,
      4     train_dataloader = train_dataloader,
      5     val_dataloaders = val_dataloader,

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/lr_finder.py in lr_find(self, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold)
    198 
    199         # Fit, lr & loss logged in callback
--> 200         self.fit(model,
    201                  train_dataloader=train_dataloader,
    202                  val_dataloaders=val_dataloaders)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py in wrapped_fn(self, *args, **kwargs)
     46             if entering is not None:
     47                 self.state = entering
---> 48             result = fn(self, *args, **kwargs)
     49 
     50             # The INTERRUPTED state can be set inside the run function. To indicate that run was interrupted

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
   1033         if self.use_ddp2:
   1034             self.accelerator_backend = DDP2Backend(self)
-> 1035             self.accelerator_backend.setup()
   1036             self.accelerator_backend.train(model)
   1037 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp2_backend.py in setup(self)
     43 
     44     def setup(self):
---> 45         self._resolve_task_idx()
     46 
     47     def _resolve_task_idx(self):

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp2_backend.py in _resolve_task_idx(self)
     54             except Exception as e:
     55                 m = 'ddp2 only works in SLURM or via torchelastic with the WORLD_SIZE, LOCAL_RANK, GROUP_RANK flags'
---> 56                 raise MisconfigurationException(m)
     57 
     58     def train(self, model):

MisconfigurationException: ddp2 only works in SLURM or via torchelastic with the WORLD_SIZE, LOCAL_RANK, GROUP_RANK flags

and when I hard-code distributed_backend to dp (which is what I would expect to work most readily), I get

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-29-01060df08a43> in <module>
      1 # Find optimal learning rate
----> 2 res = trainer.lr_find(
      3     tft,
      4     train_dataloader = train_dataloader,
      5     val_dataloaders = val_dataloader,

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/lr_finder.py in lr_find(self, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold)
    198 
    199         # Fit, lr & loss logged in callback
--> 200         self.fit(model,
    201                  train_dataloader=train_dataloader,
    202                  val_dataloaders=val_dataloaders)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py in wrapped_fn(self, *args, **kwargs)
     46             if entering is not None:
     47                 self.state = entering
---> 48             result = fn(self, *args, **kwargs)
     49 
     50             # The INTERRUPTED state can be set inside the run function. To indicate that run was interrupted

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
   1062             self.accelerator_backend = DataParallelBackend(self)
   1063             self.accelerator_backend.setup(model)
-> 1064             results = self.accelerator_backend.train()
   1065             self.accelerator_backend.teardown()
   1066 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_backend.py in train(self)
     95     def train(self):
     96         model = self.trainer.model
---> 97         results = self.trainer.run_pretrain_routine(model)
     98         return results
     99 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
   1222 
   1223         # run a few val batches before training starts
-> 1224         self._run_sanity_check(ref_model, model)
   1225 
   1226         # clear cache before training

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in _run_sanity_check(self, ref_model, model)
   1255             num_loaders = len(self.val_dataloaders)
   1256             max_batches = [self.num_sanity_val_steps] * num_loaders
-> 1257             eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
   1258 
   1259             # allow no returns from eval

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py in _evaluate(self, model, dataloaders, max_batches, test_mode)
    394         # ---------------------
    395         using_eval_result = len(outputs) > 0 and len(outputs[0]) > 0 and isinstance(outputs[0][0], EvalResult)
--> 396         eval_results = self.__run_eval_epoch_end(test_mode, outputs, dataloaders, using_eval_result)
    397 
    398         # log callback metrics

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py in __run_eval_epoch_end(self, test_mode, outputs, dataloaders, using_eval_result)
    488                     eval_results = self.__gather_epoch_end_eval_results(outputs)
    489 
--> 490                 eval_results = model.validation_epoch_end(eval_results)
    491                 user_reduced = True
    492 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in validation_epoch_end(self, outputs)
    142 
    143     def validation_epoch_end(self, outputs):
--> 144         log, _ = self.epoch_end(outputs, label="val")
    145         return log
    146 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in epoch_end(self, outputs, label)
    611         run at epoch end for training or validation
    612         """
--> 613         log, out = super().epoch_end(outputs, label=label)
    614         if self.log_interval(label == "train") > 0:
    615             self._log_interpretation(out, label=label)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in epoch_end(self, outputs, label)
    245             outputs = [out["callback_metrics"] for out in outputs]
    246         # log average loss and metrics
--> 247         n_samples = sum([x["n_samples"] for x in outputs])
    248         avg_loss = torch.stack([x[f"{label}_loss"] * x["n_samples"] / n_samples for x in outputs]).sum()
    249         log_keys = outputs[0]["log"].keys()

TypeError: unsupported operand type(s) for +: 'int' and 'list'

When I use ddp (as recommended for pytorch, given the speedup), the pipeline freezes and running watch nvidia-smi from the terminal just shows the GPUs aren't moving and aren't loading any memory for processing.

This error is thrown using the same setup as I had in #85, which I got working on a single GPU but now that I'm doing multivariate time series across all 50 states I'd really like to use both my GPUs to speed up the runtime.

Thanks!

AlexMRuch commented 3 years ago

Follow-up: the freezing with ddp isn't just with the learning rate optimizer, it also happens with the main training function:

# Stop training, when loss metric does not improve on validation set
early_stop_callback = EarlyStopping(
    monitor = "val_loss",
    min_delta = 1e-4,
    patience = 10,
    verbose = False,
    mode = "min"
)
lr_logger = LearningRateLogger()  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # log to tensorboard

# Update trainer
trainer = pl.Trainer(
    max_epochs = 500,
    gpus = [0,1],
    distributed_backend = 'ddp',  # https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes
    weights_summary = "top",
    gradient_clip_val = 0.1,
    early_stop_callback = early_stop_callback,
    #limit_train_batches = 20,  # comment in for training, running validation every 20 batches
    #fast_dev_run=True,  # comment in to check that network or dataset has no serious bugs
    callbacks = [lr_logger],
    logger = logger
)

# Update model
tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate = res.suggestion(), #use res.suggestion() or manual input from lr_finder above (e.g., 0.03)
    hidden_size = 16,  # biggest influence network size
    attention_head_size = 1,
    dropout = 0.1,
    hidden_continuous_size = 8,
    output_size = 7,  # QuantileLoss has 7 quantiles by default
    loss=QuantileLoss(),
    log_interval = 10,  # log example every 10 batches
    reduce_on_plateau_patience = 4,  # reduce learning automatically
)
print(f"Number of parameters in transformer model: {tft.size()/1e3:.1f}k")

# Train model
trainer.fit(
    tft,
    train_dataloader = train_dataloader,
    val_dataloaders = val_dataloader
)

Also, when I use ddp in that context, only 1 of my 2 GPUs is recognized by the system, even when setting gpus = 2 or doing gpus = [0,1]:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
Number of parameters in transformer model: 23.4k
jdb78 commented 3 years ago

Interesting, I have not tried that, yet. Let me look into that pickling error. It might be the main issue.

jdb78 commented 3 years ago

Could you try with the newest master? I hope some of the issues are directly fixed. Maybe log_interval=-1 also will help.

AlexMRuch commented 3 years ago

Thanks for all these updates!

With pytorch-forecasting=0.5.2, torch.cuda.device_count() returns 2 (the expected value); however, now pl.Trainer(gpus = [0,1]) and pl.Trainer(gpus = 2) both return

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Running pl.Trainer(gpus=2, distributed_backend='ddp') throws the same error. Really weird...

I think this is because of the recent upgrade to pytorch-lightning > 1.0, as their docs for Trainer have updated: https://pytorch-lightning.readthedocs.io/en/latest/trainer.html.

On the other hand, when I open Python and only load pytorch-lightning and then directly run the code above it works:

Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pytorch_lightning as pl
>>> pl.Trainer(gpus=2, distributed_backend='ddp')
/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:45: DeprecationWarning: distributed_backend has been renamed to accelerator. Deprecated in 1.0.0, will be removed in 1.2.0
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
<pytorch_lightning.trainer.trainer.Trainer object at 0x7f7a2d7bf340>

When I try to initialize the Trainer object right at the very beginning of the notebook, It loads fine but throws a future warning:

/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)
/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:45: DeprecationWarning: distributed_backend has been renamed to accelerator. Deprecated in 1.0.0, will be removed in 1.2.0
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

I'll play around with the notebook and try to track down what's going on and will update you.

AlexMRuch commented 3 years ago

I don't know what kind of wizardry just happened, but after restarting my notebook the issue went away – pl.Trainer worked just fine 🤯

For good measure I deleted the whole environment and recreated it from scratch.

Now, however, when I use multiple GPUs for hyperparameter optimization, the notebook freezes. Also, for whatever reason the model states that both GPUs are visible but only one of them is used for the hyperparameter optimization. Also, the hyperparameter optimization freezes with two GPUs.

In my terminal window that's running Jupyter Lab, I get

Traceback (most recent call last):
  File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/traitlets/config/application.py", line 844, in launch_instance
    app.initialize(argv)
  File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
    return method(app, *args, **kwargs)
  File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 567, in initialize
    self.init_sockets()
  File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 271, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 218, in _bind_socket
    return self._try_bind_socket(s, port)
  File "/home/amruch/anaconda3/envs/forecasting/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 194, in _try_bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 550, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 26, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

Which I haven't gotten before and seems to be related to ipykernel and not directly to pytorch-forecasting, which is odd that it would mess up the whole pytorch-forecasting runtime unless I'm missing something...

Also, when I set gpus=[0] this goes away and the script runs smoothly (but only with 1 GPU, so still no luck with multi-GPU).

I'm still playing around and am trying to figure out what's going on.

AlexMRuch commented 3 years ago

Oh – nope, got the "only recognizing one GPU" issue again.

Because I had issues with using two GPUs, I ran the learning rate finder with just 1 GPU (ran fine) and then tried to use 2 GPUs for training:

# Stop training, when loss metric does not improve on validation set
early_stop_callback = EarlyStopping(
    monitor = "val_loss",
    min_delta = 1e-4,
    patience = 10,
    verbose = False,
    mode = "min"
)
lr_logger = LearningRateMonitor()  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # log to tensorboard

# Update trainer
trainer = pl.Trainer(
    max_epochs = 500,
    gpus = 2, #use 2 for 2 GPUs
    distributed_backend = 'ddp',  # https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes
    weights_summary = "top",
    gradient_clip_val = 0.1,
    #limit_train_batches = 20,  # comment in for training, running validation every 20 batches
    #fast_dev_run=True,  # comment in to check that network or dataset has no serious bugs
    callbacks = [lr_logger, early_stop_callback],
    logger = logger
)

# Update model
tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate = res.suggestion(), #use res.suggestion() or manual input from lr_finder above (e.g., 0.03)
    hidden_size = 16,  # biggest influence network size
    attention_head_size = 1,
    dropout = 0.1,
    hidden_continuous_size = 8,
    output_size = 7,  # QuantileLoss has 7 quantiles by default
    loss=QuantileLoss(),
    log_interval = 10,  # log example every 10 batches
    reduce_on_plateau_patience = 4,  # reduce learning automatically
)
print(f"Number of parameters in transformer model: {tft.size()/1e3:.1f}k")

But this returned

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Number of parameters in transformer model: 23.4k

Also, once I run that bit, then when I run pl.Trainer(gpus = 2) it only returns 1 GPU 🤯

Weirder yet is if I do this image ^^^ Once I run pl.Trainer(gpus = X) it won't let me change the number of GPUs (X).

So if I use 1 GPU for the learning rate finder I can't jump up to use 2 GPUs for full training.

AlexMRuch commented 3 years ago

Unrelated bug 🐛, but I get a Missing logger folder: lightning_logs/default error now as well when I run

trainer.fit(
    tft,
    train_dataloader = train_dataloader,
    val_dataloaders = val_dataloader
)

This folder used to be created automatically, but now I have to manually create it

AlexMRuch commented 3 years ago

Confirmed that the 2 GPU training still fails even when it's run on just training (learning rate finder is skipped and 0.03) is used: image Note that when I run it here, my terminal running jupyter lab does not throw the same ipykernel error as noted above.

jdb78 commented 3 years ago

Hm. Can you run it as a script? Maybe it is a ipython issue eventually. Sorry, I do not have readily 2 GPUs at hands to debug the issue. Have to say your issues are really helpful though!! Much appreciated!

AlexMRuch commented 3 years ago

Yeah, I was planning to convert it to a script soon just because I've done enough testing to get the notebook to work with multiple variables across all states. I'll let you know how that goes.

By the way, it seems like the issue may actually be due to pyzmq for what it's worth.

Also, after letting the notebook sit on that * running note for what felt like 5+ mins with no sign of life, it did pop out

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

I'm going to let it sit and see if we get to 2/2 and if it actually runs on both GPUs. Really odd that it takes so long for it to do the multiprocessing.

I may also post some of these issues on the pytorch-lightning channel since they're not directly related to pytorch-forecasting (e.g., there's nothing really you can do about the pl.Trainer bit).

I'll let you know if the multi-GPU eventually kicks off!

Glad to hear the posts have been helpful! Thanks for your feedback and support on these!

AlexMRuch commented 3 years ago

Okay, so the issue was having distributed_backend=ddp – it should have been dp (it runs right away on both GPUs with dp) 🤦 . I misread that note in https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes (ddp is multi-GPU over multiple machines, but the note about ddp > dp threw me off, and I should have read in more detail).

That being said, I now get this tensor shape error (which I confirmed only happens with multi-GPU runs and does not happen when gpus=1):

# Stop training, when loss metric does not improve on validation set
early_stop_callback = EarlyStopping(
    monitor = "val_loss",
    min_delta = 1e-4,
    patience = 10,
    verbose = False,
    mode = "min"
)
lr_logger = LearningRateMonitor()  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # logging results to a tensorboard

# Update trainer
trainer = pl.Trainer(
    max_epochs = 500,
    gpus = 2,
    distributed_backend = 'dp', # https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes
    weights_summary = "top",
    gradient_clip_val = 0.1,
    #limit_train_batches = 20,  # comment in for training, running validation every 20 batches
    #fast_dev_run=True,  # comment in to check that network or dataset has no serious bugs
    callbacks = [lr_logger, early_stop_callback],
    logger = logger
)

# Update model
tft = TemporalFusionTransformer.from_dataset(
    training,
    #learning_rate = res.suggestion(), #use res.suggestion() or manual input from lr_finder above (e.g., 0.03)
    learning_rate = 0.03, #use res.suggestion() or manual input from lr_finder above (e.g., 0.03)
    hidden_size = 16,  # biggest influence network size
    attention_head_size = 1,
    dropout = 0.1,
    hidden_continuous_size = 8,
    output_size = 7,  # QuantileLoss has 7 quantiles by default
    loss=QuantileLoss(),
    log_interval = 10,  # log example every 10 batches
    reduce_on_plateau_patience = 4,  # reduce learning automatically
)
print(f"Number of parameters in transformer model: {tft.size()/1e3:.1f}k")

# Train model
trainer.fit(
    tft,
    train_dataloader = train_dataloader,
    val_dataloaders = val_dataloader
)

now throws

   | Name                               | Type                            | Params
----------------------------------------------------------------------------------------
0  | loss                               | QuantileLoss                    | 0     
1  | logging_metrics                    | ModuleList                      | 0     
2  | input_embeddings                   | MultiEmbedding                  | 21    
3  | prescalers                         | ModuleDict                      | 192   
4  | static_variable_selection          | VariableSelectionNetwork        | 1 K   
5  | encoder_variable_selection         | VariableSelectionNetwork        | 6 K   
6  | decoder_variable_selection         | VariableSelectionNetwork        | 1 K   
7  | static_context_variable_selection  | GatedResidualNetwork            | 1 K   
8  | static_context_initial_hidden_lstm | GatedResidualNetwork            | 1 K   
9  | static_context_initial_cell_lstm   | GatedResidualNetwork            | 1 K   
10 | static_context_enrichment          | GatedResidualNetwork            | 1 K   
11 | lstm_encoder                       | LSTM                            | 2 K   
12 | lstm_decoder                       | LSTM                            | 2 K   
13 | post_lstm_gate_encoder             | GatedLinearUnit                 | 544   
14 | post_lstm_add_norm_encoder         | AddNorm                         | 32    
15 | static_enrichment                  | GatedResidualNetwork            | 1 K   
16 | multihead_attn                     | InterpretableMultiHeadAttention | 1 K   
17 | post_attn_gate_norm                | GateAddNorm                     | 576   
18 | pos_wise_ff                        | GatedResidualNetwork            | 1 K   
19 | pre_output_gate_norm               | GateAddNorm                     | 576   
20 | output_layer                       | Linear                          | 119   
Epoch 0: 0%

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-29-14fda4f79b4a> in <module>
      1 # Train model
----> 2 trainer.fit(
      3     tft,
      4     train_dataloader = train_dataloader,
      5     val_dataloaders = val_dataloader

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    438         self.call_hook('on_fit_start')
    439 
--> 440         results = self.accelerator_backend.train()
    441         self.accelerator_backend.teardown()
    442 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in train(self)
     95 
     96         # train or test
---> 97         results = self.train_or_test()
     98 
     99         return results

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
     64             results = self.trainer.run_test()
     65         else:
---> 66             results = self.trainer.train()
     67         return results
     68 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
    481 
    482                 # run train epoch
--> 483                 self.train_loop.run_training_epoch()
    484 
    485                 if self.max_steps and self.max_steps <= self.global_step:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    539             # TRAINING_STEP + TRAINING_STEP_END
    540             # ------------------------------------
--> 541             batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
    542 
    543             # when returning -1 from train_step, we end epoch early

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
    671                 # calculate loss (train step + train step end)
    672                 # -------------------
--> 673                 opt_closure_result = self.training_step_and_backward(
    674                     split_batch,
    675                     batch_idx,

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
    758         """
    759         # lightning module hook
--> 760         result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
    761 
    762         if result is None:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in training_step(self, split_batch, batch_idx, opt_idx, hiddens)
    302         with self.trainer.profiler.profile('model_forward'):
    303             args = self.build_train_args(split_batch, batch_idx, opt_idx, hiddens)
--> 304             training_step_output = self.trainer.accelerator_backend.training_step(args)
    305             training_step_output = self.trainer.call_hook('training_step_end', training_step_output)
    306 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in training_step(self, args)
    108                 output = self.trainer.model(*args)
    109         else:
--> 110             output = self.trainer.model(*args)
    111         return output
    112 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in forward(self, *inputs, **kwargs)
     85 
     86         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
---> 87         outputs = self.parallel_apply(replicas, inputs, kwargs)
     88 
     89         if isinstance(outputs[0], Result):

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    149 
    150     def parallel_apply(self, replicas, inputs, kwargs):
--> 151         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    152 
    153 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(modules, inputs, kwargs_tup, devices)
    308         output = results[i]
    309         if isinstance(output, Exception):
--> 310             raise output
    311         outputs.append(output)
    312     return outputs

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in _worker(i, module, input, kwargs, device)
    261                 # CHANGE
    262                 if module.training:
--> 263                     output = module.training_step(*input, **kwargs)
    264                     fx_called = 'training_step'
    265                 elif module.testing:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in training_step(self, batch, batch_idx)
    128         """
    129         x, y = batch
--> 130         log, _ = self.step(x, y, batch_idx, label="train")
    131         # log loss
    132         self.log("train_loss", log["loss"], on_step=True, on_epoch=True, prog_bar=True)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in step(self, x, y, batch_idx, label)
    568         """
    569         # extract data and run model
--> 570         log, out = super().step(x, y, batch_idx, label=label)
    571         # calculate interpretations etc for latter logging
    572         if self.log_interval(label == "train") > 0:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in step(self, x, y, batch_idx, label)
    204                 )
    205             else:
--> 206                 loss = self.loss(prediction, y)
    207 
    208         # logging losses

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py in forward(self, *args, **kwargs)
    143         """
    144         # add current step
--> 145         self.update(*args, **kwargs)
    146         self._forward_cache = None
    147 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py in wrapped_func(*args, **kwargs)
    189         def wrapped_func(*args, **kwargs):
    190             self._computed = None
--> 191             return update(*args, **kwargs)
    192         return wrapped_func
    193 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/metrics.py in update(self, y_pred, target)
    313             weight = None
    314 
--> 315         losses = self.loss(y_pred, target)
    316         # weight samples
    317         if weight is not None:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/metrics.py in loss(self, y_pred, target)
    450         losses = []
    451         for i, q in enumerate(self.quantiles):
--> 452             errors = target - y_pred[..., i]
    453             losses.append(torch.max((q - 1) * errors, q * errors).unsqueeze(-1))
    454         losses = torch.cat(losses, dim=2)

RuntimeError: The size of tensor a (30) must match the size of tensor b (43) at non-singleton dimension 1
AlexMRuch commented 3 years ago

Also letting you know that this solves the ipykernel issue with pyZMQ – probably because it was trying to find or send something to a second (non-existent) node on a cluster. So yay for that! Looks like this is back to something that's only related to pytorch-forecast 😄

AlexMRuch commented 3 years ago

Confirmed this same error – post-dp switch – is thrown on the learning rate finder:

# Configure network and trainer
pl.seed_everything(407)
trainer = pl.Trainer(
    gpus = 2, #use 2 for both GPUs
    distributed_backend = 'dp',  # https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-modes
    gradient_clip_val = 0.1  # hyperparam to prevent gradient divergance for RNNs
)

tft = TemporalFusionTransformer.from_dataset(
    training,
    # not meaningful for finding the learning rate but otherwise very important
    learning_rate = 0.03,
    hidden_size = 16,  # most important hyperparameter apart from learning rate
    # number of attention heads. Set to up to 4 for large datasets
    attention_head_size = 1,
    dropout = 0.1,  # between 0.1 and 0.3 are good values
    hidden_continuous_size = 8,  # set to <= hidden_size
    output_size = 7,  # 7 quantiles by default
    loss = QuantileLoss(),
    # reduce learning rate if no improvement in validation loss after x epochs
    reduce_on_plateau_patience = 4,
)
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")

# Find optimal learning rate
res = trainer.tuner.lr_find(
    tft,
    train_dataloader = train_dataloader,
    val_dataloaders = val_dataloader,
    max_lr = 10.,
    min_lr = 1e-6,
)

print(f"Suggested learning rate: {res.suggestion()}")
fig = res.plot(show = True, suggest = True)
fig.show()

throws

   | Name                               | Type                            | Params
----------------------------------------------------------------------------------------
0  | loss                               | QuantileLoss                    | 0     
1  | logging_metrics                    | ModuleList                      | 0     
2  | input_embeddings                   | MultiEmbedding                  | 21    
3  | prescalers                         | ModuleDict                      | 192   
4  | static_variable_selection          | VariableSelectionNetwork        | 1 K   
5  | encoder_variable_selection         | VariableSelectionNetwork        | 6 K   
6  | decoder_variable_selection         | VariableSelectionNetwork        | 1 K   
7  | static_context_variable_selection  | GatedResidualNetwork            | 1 K   
8  | static_context_initial_hidden_lstm | GatedResidualNetwork            | 1 K   
9  | static_context_initial_cell_lstm   | GatedResidualNetwork            | 1 K   
10 | static_context_enrichment          | GatedResidualNetwork            | 1 K   
11 | lstm_encoder                       | LSTM                            | 2 K   
12 | lstm_decoder                       | LSTM                            | 2 K   
13 | post_lstm_gate_encoder             | GatedLinearUnit                 | 544   
14 | post_lstm_add_norm_encoder         | AddNorm                         | 32    
15 | static_enrichment                  | GatedResidualNetwork            | 1 K   
16 | multihead_attn                     | InterpretableMultiHeadAttention | 1 K   
17 | post_attn_gate_norm                | GateAddNorm                     | 576   
18 | pos_wise_ff                        | GatedResidualNetwork            | 1 K   
19 | pre_output_gate_norm               | GateAddNorm                     | 576   
20 | output_layer                       | Linear                          | 119   
Finding best initial lr: 0%

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-29-6bbc95624f88> in <module>
      1 # Find optimal learning rate
----> 2 res = trainer.tuner.lr_find(
      3     tft,
      4     train_dataloader = train_dataloader,
      5     val_dataloaders = val_dataloader,

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/tuner/tuning.py in lr_find(self, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold, datamodule)
    115             datamodule: Optional[LightningDataModule] = None
    116     ):
--> 117         return lr_find(
    118             self.trainer,
    119             model,

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/tuner/lr_finder.py in lr_find(trainer, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold, datamodule)
    170 
    171     # Fit, lr & loss logged in callback
--> 172     trainer.fit(model,
    173                 train_dataloader=train_dataloader,
    174                 val_dataloaders=val_dataloaders,

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    438         self.call_hook('on_fit_start')
    439 
--> 440         results = self.accelerator_backend.train()
    441         self.accelerator_backend.teardown()
    442 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in train(self)
     95 
     96         # train or test
---> 97         results = self.train_or_test()
     98 
     99         return results

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
     64             results = self.trainer.run_test()
     65         else:
---> 66             results = self.trainer.train()
     67         return results
     68 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
    481 
    482                 # run train epoch
--> 483                 self.train_loop.run_training_epoch()
    484 
    485                 if self.max_steps and self.max_steps <= self.global_step:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    539             # TRAINING_STEP + TRAINING_STEP_END
    540             # ------------------------------------
--> 541             batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
    542 
    543             # when returning -1 from train_step, we end epoch early

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
    671                 # calculate loss (train step + train step end)
    672                 # -------------------
--> 673                 opt_closure_result = self.training_step_and_backward(
    674                     split_batch,
    675                     batch_idx,

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
    758         """
    759         # lightning module hook
--> 760         result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
    761 
    762         if result is None:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in training_step(self, split_batch, batch_idx, opt_idx, hiddens)
    302         with self.trainer.profiler.profile('model_forward'):
    303             args = self.build_train_args(split_batch, batch_idx, opt_idx, hiddens)
--> 304             training_step_output = self.trainer.accelerator_backend.training_step(args)
    305             training_step_output = self.trainer.call_hook('training_step_end', training_step_output)
    306 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_accelerator.py in training_step(self, args)
    108                 output = self.trainer.model(*args)
    109         else:
--> 110             output = self.trainer.model(*args)
    111         return output
    112 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in forward(self, *inputs, **kwargs)
     85 
     86         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
---> 87         outputs = self.parallel_apply(replicas, inputs, kwargs)
     88 
     89         if isinstance(outputs[0], Result):

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    149 
    150     def parallel_apply(self, replicas, inputs, kwargs):
--> 151         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    152 
    153 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in parallel_apply(modules, inputs, kwargs_tup, devices)
    308         output = results[i]
    309         if isinstance(output, Exception):
--> 310             raise output
    311         outputs.append(output)
    312     return outputs

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py in _worker(i, module, input, kwargs, device)
    261                 # CHANGE
    262                 if module.training:
--> 263                     output = module.training_step(*input, **kwargs)
    264                     fx_called = 'training_step'
    265                 elif module.testing:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in training_step(self, batch, batch_idx)
    128         """
    129         x, y = batch
--> 130         log, _ = self.step(x, y, batch_idx, label="train")
    131         # log loss
    132         self.log("train_loss", log["loss"], on_step=True, on_epoch=True, prog_bar=True)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in step(self, x, y, batch_idx, label)
    568         """
    569         # extract data and run model
--> 570         log, out = super().step(x, y, batch_idx, label=label)
    571         # calculate interpretations etc for latter logging
    572         if self.log_interval(label == "train") > 0:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in step(self, x, y, batch_idx, label)
    204                 )
    205             else:
--> 206                 loss = self.loss(prediction, y)
    207 
    208         # logging losses

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py in forward(self, *args, **kwargs)
    143         """
    144         # add current step
--> 145         self.update(*args, **kwargs)
    146         self._forward_cache = None
    147 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py in wrapped_func(*args, **kwargs)
    189         def wrapped_func(*args, **kwargs):
    190             self._computed = None
--> 191             return update(*args, **kwargs)
    192         return wrapped_func
    193 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/metrics.py in update(self, y_pred, target)
    313             weight = None
    314 
--> 315         losses = self.loss(y_pred, target)
    316         # weight samples
    317         if weight is not None:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/metrics.py in loss(self, y_pred, target)
    450         losses = []
    451         for i, q in enumerate(self.quantiles):
--> 452             errors = target - y_pred[..., i]
    453             losses.append(torch.max((q - 1) * errors, q * errors).unsqueeze(-1))
    454         losses = torch.cat(losses, dim=2)

RuntimeError: The size of tensor a (30) must match the size of tensor b (43) at non-singleton dimension 1

Given that consistency, my intuition tells me the 30 is how big my forecasting window is (30 days), but not sure where the 43 is coming from nor why this only happens in a multi-GPU run and not in a single-GPU run.

AlexMRuch commented 3 years ago

When I use %debugger in the notebook these are what the variables look like that are throwing the error:

ipdb>  print(i)
0
ipdb>  print(q)
0.02
ipdb>  y_pred.shape
torch.Size([16, 43, 7])
ipdb>  y_pred[0].shape
torch.Size([43, 7])
ipdb>  y_pred[..., i].shape
torch.Size([16, 43])
ipdb>  len(target)
16
ipdb>  target[0].shape
torch.Size([30])

So y_pred is the one that's too long.

(Still debugging)

AlexMRuch commented 3 years ago

Still exploring this issue. I wanted to unpack my notebook and make it more module since the pipeline is getting built up first. Everything else seems to be running smoothly now except for the multi-GPU bit. Something has to be happening to y_pred right before it goes from pytorch-forecasting into pytorch-lightning. Should be easy to track down. I'll keep you updated. If you think of anything I should try (I can pull and install a GitHub version of pytorch-forecasting and try the edit and do a PR), please let me know 😄 My present plan is just to print the shape of the prediction-based tensor as it's modified up to when to gets moved to pytorch-lightning and when it goes from pytorch-lightning to pytorch-forecasting.

jdb78 commented 3 years ago

@AlexMRuch Any success?

AlexMRuch commented 3 years ago

Not yet. I have it on my calendar to debug after Thanksgiving. Have a few deadlines to hit before then, and the code is working satisfactorily on one GPU for now. Shouldn’t be too difficult to pin down once I get time to run things though. Definitely plan on figuring it out, though, as I’m looking forward to speeding up the hyperparameter optimization :-)

On Oct 31, 2020, at 5:57 PM, Jan Beitner notifications@github.com wrote:

@AlexMRuch https://github.com/AlexMRuch Any success?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jdb78/pytorch-forecasting/issues/103#issuecomment-719993355, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIYWOIGEKSOB3ZQ2IOVYFLSNSB3JANCNFSM4SSEFKCA.

jdb78 commented 3 years ago

Totally understand! No rush :) Just was curious ;)

AlexMRuch commented 3 years ago

Hey hey – just wanted to update you that my projects got shifted up and we're not using time series anymore 😬 I don't think I'm going to have time to investigate this anytime soon, especially as I'm on family leave for our daughter's birth until mid March.

jdb78 commented 3 years ago

See also #215. At least "ddp" seems to work now.

dempseyryan commented 3 years ago

Was a fix ever found for this? I tried using 2 GPUs with ddp and I have the above mentioned issue wherein training does not start.