kabouzeid / point2vec

Self-Supervised Representation Learning on Point Clouds (GCPR 2023 | T4V Workshop @ CVPR 2023)
https://point2vec.ka.codes
MIT License
78 stars 7 forks source link

question about ddp #1

Closed CHANG1412 closed 1 year ago

CHANG1412 commented 1 year ago

Hi,

Thanks for the great work! I was trying to pretrain the model with Distributed Data Parallel. I added "strategy":"ddp" in trainer_defaults. However, I got a runtimeerror: tensors must be CUDA and dense. I'm just wondering have you encountered the same problem, and how you fixed it? Thank you! The full trace is as followed:

Traceback (most recent call last): File "/home/.conda/envs/p2v/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/.conda/envs/p2v/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/point2vec/point2vec/point2vec/main.py", line 13, in cli = LightningCLI( File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/cli.py", line 350, in init self._run_subcommand(self.subcommand) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/cli.py", line 626, in _run_subcommand fn(fn_kwargs) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, *kwargs) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(args, kwargs) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1105, in _run self._call_setup_hook() # allow user to setup lightning_module in accelerator environment File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1449, in _call_setup_hook self._call_lightning_module_hook("setup", stage=fn) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/home/point2vec/point2vec/point2vec/models/point2vec.py", line 186, in setup or self.trainer.estimated_stepping_batches File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 2777, in estimated_stepping_batches self.reset_train_dataloader() File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1849, in reset_train_dataloader if has_len_all_ranks(self.train_dataloader, self.strategy, module) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py", line 150, in has_len_all_ranks total_length = training_type.reduce(torch.tensor(local_length).to(model.device), reduce_op="sum") File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 347, in reduce tensor = sync_ddp_if_available(tensor, group, reduce_op=reduce_op) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/utilities/distributed.py", line 127, in sync_ddp_if_available return sync_ddp(result, group=group, reduce_op=reduce_op) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/utilities/distributed.py", line 168, in sync_ddp torch.distributed.all_reduce(result, op=op, group=group, async_op=False) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: Tensors must be CUDA and dense

kabouzeid commented 1 year ago

Unfortunately, there is a bug in Lightning 1.7.7 with multi GPUs, where trainer.estimated_stepping_batches crashes when called from within the setup method. As a workaround, you have to set --model.fix_estimated_stepping_batches 64800 to the correct value (The same value that trainer.estimated_stepping_batches would be. For example, for ShapeNet with a batch size of 512 this would be 64800).

See also this comment: https://github.com/kabouzeid/point2vec/blob/b772c428e05c290b44ae64d7beb357937e0c71ad/point2vec/models/point2vec.py#L192-L194

Also, there is no need to set strategy=ddp since this is already the default.

The Fix

So, for point2vec pre-training on ShapeNet with 2 GPUs you would need to append the following to your command: --data.batch_size 256 --trainer.devices 2 --model.fix_estimated_stepping_batches 64800. For 4 GPUs it would be: --data.batch_size 128 --trainer.devices 4 --model.fix_estimated_stepping_batches 64800.