Closed CHANG1412 closed 1 year ago
Unfortunately, there is a bug in Lightning 1.7.7 with multi GPUs, where trainer.estimated_stepping_batches
crashes when called from within the setup
method. As a workaround, you have to set --model.fix_estimated_stepping_batches 64800
to the correct value (The same value that trainer.estimated_stepping_batches
would be. For example, for ShapeNet with a batch size of 512 this would be 64800).
See also this comment: https://github.com/kabouzeid/point2vec/blob/b772c428e05c290b44ae64d7beb357937e0c71ad/point2vec/models/point2vec.py#L192-L194
Also, there is no need to set strategy=ddp
since this is already the default.
So, for point2vec pre-training on ShapeNet with 2 GPUs you would need to append the following to your command:
--data.batch_size 256 --trainer.devices 2 --model.fix_estimated_stepping_batches 64800
.
For 4 GPUs it would be: --data.batch_size 128 --trainer.devices 4 --model.fix_estimated_stepping_batches 64800
.
Hi,
Thanks for the great work! I was trying to pretrain the model with Distributed Data Parallel. I added "strategy":"ddp" in trainer_defaults. However, I got a runtimeerror: tensors must be CUDA and dense. I'm just wondering have you encountered the same problem, and how you fixed it? Thank you! The full trace is as followed:
Traceback (most recent call last): File "/home/.conda/envs/p2v/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/.conda/envs/p2v/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/point2vec/point2vec/point2vec/main.py", line 13, in
cli = LightningCLI(
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/cli.py", line 350, in init
self._run_subcommand(self.subcommand)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/cli.py", line 626, in _run_subcommand
fn(fn_kwargs)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, *kwargs)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(args, kwargs)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1105, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1449, in _call_setup_hook
self._call_lightning_module_hook("setup", stage=fn)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/point2vec/point2vec/point2vec/models/point2vec.py", line 186, in setup
or self.trainer.estimated_stepping_batches
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 2777, in estimated_stepping_batches
self.reset_train_dataloader()
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1849, in reset_train_dataloader
if has_len_all_ranks(self.train_dataloader, self.strategy, module)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py", line 150, in has_len_all_ranks
total_length = training_type.reduce(torch.tensor(local_length).to(model.device), reduce_op="sum")
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 347, in reduce
tensor = sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/utilities/distributed.py", line 127, in sync_ddp_if_available
return sync_ddp(result, group=group, reduce_op=reduce_op)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/utilities/distributed.py", line 168, in sync_ddp
torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
File "/home/.conda/envs/p2v/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense