Open dempseyryan opened 3 years ago
ddp
is the preferred solution but does not work in notebooks (see #215). The bug for ddp_spawn
probably needs some investigation into pickling some objects. Generally, it is not recommended for speed. See https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#distributed-data-parallel-spawn.
Is it possible to run a script for you? ddp
seems to be the best solution.
I tried training with a script (identical code, just no longer a notebook) with accelerator='ddp'
and I get the following error about encoder lengths:
Traceback (most recent call last):
File "Beginning_pipeline.py", line 407, in <module>
val_dataloaders=val_dataloader,
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train
results = self.train_or_test()
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
results = self.trainer.train()
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in train
self.train_loop.run_training_epoch()
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 564, in run_training_epoch
for batch_idx, (batch, is_last_batch) in train_dataloader:
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/profiler/profilers.py", line 83, in profile_iterable
value = next(iterator)
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 46, in _with_is_last
last = next(it)
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_forecasting/data/timeseries.py", line 1543, in _collate_fn
encoder_lengths = torch.tensor([batch[0]["encoder_length"] for batch in batches], dtype=torch.long)
File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_forecasting/data/timeseries.py", line 1543, in <listcomp>
encoder_lengths = torch.tensor([batch[0]["encoder_length"] for batch in batches], dtype=torch.long)
KeyError: 0
I don't know enough about this stuff to speculate whether this is a) an issue with PL's fit() method or b) an issue with PF's TFT class, but I figured I would post this here in case anyone has ideas.
@dempseyryan any news on this? I tried and had same problem with deepar
@nicocheh unfortunately I never wound up resolving it. I'm no longer working on the project, but my "workaround" (if you can even call it that) was to train multiple neural networks simultaneously with 1 gpu each. This way I'm still speeding up the process of tuning and experimenting hyperparameters. Now if you have everything tuned and want to train a final time obviously it would be nice to be able to work this properly...
Similarities
I notice this is similar to #103 and #215, which were seemingly resolved. (?)
Also I should mention I'm not sure if this is perhaps a PL issue.
Expected behavior
I would like to train a model across 2 GPUs in order to speed up training. I just set the
gpus
flag of Pytorch Lightning's Trainer constructor to 2:gpus=2
.Accelerator
With the flag
accelerator='ddp'
, I get one error, while with the flagaccelerator='dp'
, the kernel is perpetually busy but does not begin training. Withaccelerator='ddp_spawn'
, I get a different error. I think the one to use for me would be ddp spawn, since DDP is not possible in Jupyter Notebook.I think the accelerator I need for my case is ddp spawn.
Actual behavior
With
ddp_spawn
, the following error occurs:TypeError: can't pickle torch._C.Generator objects
With
ddp
, the kernel is perpetually busy and training doesn't start (presumably because I'm using a notebook).With
dp
, the following error occurs:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
I suspect this one is because of the way my GPUs are I shouldn't be using dp but I'm really not sure. I figured I'd try all my options.
Is ddp the only one that works at this time? Is there any way of using 2 GPUs in a notebook?
Thanks for your help.
My understanding is that the sampler doesn't actually matter because PL overwrites it with
DistributedSampler
when you instantiate Trainer with gpus > 1, but I might be wrong so I included it below anyhow.Code to reproduce the problem
This call is where all three issues occur. Please see the training and trainer initialization below;
P.S. Loving this package. Getting some good results!
Edit: swapped error cases by accident