Closed zhrli closed 1 year ago
This has to do with the spectral normalization wrappers, for some reason they don't seem to work in multi-GPU setup. I'd recommend trying to train with a single GPU for now.
I found a solution,but not sure how to add this method in pytorch-ligntening.
the web is here in Chinese : https://blog.csdn.net/qq_39237205/article/details/125728708
He said he did that by setting broadcast_buffers= False
model = nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], broadcast_buffers=False, find_unused_parameters=True)
Oh cool! Glad there is a way to do it.
@all-contributors please add @zhrli for question
@peterdudfield
I've put up a pull request to add @zhrli! :tada:
Shall I close this for the moment
Yeah, sounds good, seems like there is a solution for it.
@zhrli Hi, did you get the multiple GPU setup to work?
Epoch 0: 0%| | 0/9057 [00:00<?, ?it/s]Traceback (most recent call last): File "./train/run.py", line 294, in
trainer.fit(model, datamodule)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
return trainer_fn(*args, kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
results = self._run_stage()
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
return self._run_train()
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train
self.fit_loop.run()
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, *kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(args, kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 90, in advance
outputs = self.manual_loop.run(split_batch, batch_idx)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/manual_loop.py", line 115, in advance
training_step_output = self.trainer._call_strategy_hook("training_step", step_kwargs.values())
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1766, in _call_strategy_hook
output = fn(args, kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 333, in training_step
return self.model.training_step(*args, kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/dgmr/dgmr.py", line 140, in training_step
self.manual_backward(discriminator_loss)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1351, in manual_backward
self.trainer.strategy.backward(loss, None, None, *args, kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 168, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, *args, *kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
model.backward(closure_loss, optimizer, args, kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1370, in backward
loss.backward(*args, **kwargs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/public/home/lizhaorui/conda-envs/torch-1.9/lib/python3.7/site-packages/torch/autograd/init.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).