pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
20.97k stars 3.62k forks source link

Multi-GPU trianing error when used lightning and 'AttributeError: _old_init' #5231

Open wesmail opened 2 years ago

wesmail commented 2 years ago

🐛 Describe the bug

I am trying to do multi-gpu training using pytorch lightning for graph classification task. I started with the official example provided but I can get it working without getting this error:

(pyg) username@machine:~/Lightning$ python gin.py
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name       | Type     | Params
----------------------------------------
0 | gnn        | GIN      | 41.9 K
1 | classifier | MLP      | 4.4 K
2 | train_acc  | Accuracy | 0
3 | val_acc    | Accuracy | 0
4 | test_acc   | Accuracy | 0
----------------------------------------
46.3 K    Trainable params
0         Non-trainable params
46.3 K    Total params
0.185     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/home/username/Lightning/gin.py", line 93, in <module>
    main()
  File "/home/username/Lightning/gin.py", line 86, in main
    trainer.fit(model, datamodule)
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
    mp.start_processes(
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_train
    self._run_sanity_check()
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1338, in _run_sanity_check
    val_loop._reload_evaluation_dataloaders()
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 237, in _reload_evaluation_dataloaders    self.trainer.reset_val_dataloader()
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1926, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 344, in _reset_eval_dataloader
    dataloaders = self._request_dataloader(mode)
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 427, in _request_dataloader
    with _replace_init_method(DataLoader, "dataset"), _replace_init_method(BatchSampler):
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/username/miniconda3/envs/pyg/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py", line 527, in _replace_init_method
    del cls._old_init
AttributeError: _old_init

Environment

rusty1s commented 2 years ago

Can you try to move to PL 1.6.* or move to PyTorch Lightning from master. There was a bug in PL that caused this, but it is now fixed as far as I know.