question about ddp - Githubissues

Hi,

Thanks for the great work! I was trying to pretrain the model with Distributed Data Parallel. I added "strategy":"ddp" in trainer_defaults. However, I got a runtimeerror: tensors must be CUDA and dense. I'm just wondering have you encountered the same problem, and how you fixed it? Thank you！ The full trace is as followed:

Traceback (most recent call last): File "/home/.conda/envs/p2v/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/.conda/envs/p2v/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/point2vec/point2vec/point2vec/main.py", line 13, in cli = LightningCLI( File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/cli.py", line 350, in init self._run_subcommand(self.subcommand) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/cli.py", line 626, in _run_subcommand fn(fn_kwargs) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, *kwargs) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(args, kwargs) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1105, in _run self._call_setup_hook() # allow user to setup lightning_module in accelerator environment File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1449, in _call_setup_hook self._call_lightning_module_hook("setup", stage=fn) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/home/point2vec/point2vec/point2vec/models/point2vec.py", line 186, in setup or self.trainer.estimated_stepping_batches File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 2777, in estimated_stepping_batches self.reset_train_dataloader() File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1849, in reset_train_dataloader if has_len_all_ranks(self.train_dataloader, self.strategy, module) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py", line 150, in has_len_all_ranks total_length = training_type.reduce(torch.tensor(local_length).to(model.device), reduce_op="sum") File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 347, in reduce tensor = sync_ddp_if_available(tensor, group, reduce_op=reduce_op) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/utilities/distributed.py", line 127, in sync_ddp_if_available return sync_ddp(result, group=group, reduce_op=reduce_op) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/pytorch_lightning/utilities/distributed.py", line 168, in sync_ddp torch.distributed.all_reduce(result, op=op, group=group, async_op=False) File "/home/.conda/envs/p2v/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: Tensors must be CUDA and dense

kabouzeid / point2vec

question about ddp #1

The Fix