I am trying to do multi-gpu training using pytorch lightning for graph classification task.
I started with the official example provided but I can get it working without getting this error:
(pyg) username@machine:~/Lightning$ python gin.py
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
| Name | Type | Params
----------------------------------------
0 | gnn | GIN | 41.9 K
1 | classifier | MLP | 4.4 K
2 | train_acc | Accuracy | 0
3 | val_acc | Accuracy | 0
4 | test_acc | Accuracy | 0
----------------------------------------
46.3 K Trainable params
0 Non-trainable params
46.3 K Total params
0.185 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/home/username/Lightning/gin.py", line 93, in <module>
main()
File "/home/username/Lightning/gin.py", line 86, in main
trainer.fit(model, datamodule)
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
mp.start_processes(
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
results = function(*args, **kwargs)
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
results = self._run_stage()
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
return self._run_train()
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_train
self._run_sanity_check()
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1338, in _run_sanity_check
val_loop._reload_evaluation_dataloaders()
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 237, in _reload_evaluation_dataloaders self.trainer.reset_val_dataloader()
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1926, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 344, in _reset_eval_dataloader
dataloaders = self._request_dataloader(mode)
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 427, in _request_dataloader
with _replace_init_method(DataLoader, "dataset"), _replace_init_method(BatchSampler):
File "/home/username/miniconda3/envs/pyg/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py", line 527, in _replace_init_method
del cls._old_init
AttributeError: _old_init
Environment
PyG version: 1.12.1
PyTorch version: 2.1.0
OS: Linux 4.15.0-189-generic x86_64
Python version: 3.10.4
CUDA/cuDNN version: 11.6
How you installed PyTorch and PyG (conda, pip, source): PyTorch with conda. PyG with pip "official documentaion"
Any other relevant information (e.g., version of torch-scatter):
Can you try to move to PL 1.6.* or move to PyTorch Lightning from master. There was a bug in PL that caused this, but it is now fixed as far as I know.
🐛 Describe the bug
I am trying to do multi-gpu training using pytorch lightning for graph classification task. I started with the official example provided but I can get it working without getting this error:
Environment
conda
,pip
, source): PyTorch with conda. PyG with pip "official documentaion"torch-scatter
):