pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.09k stars 3.63k forks source link

Issues with DataParallel #1823

Open vymao opened 3 years ago

vymao commented 3 years ago

🐛 Bug

I have been having intermittent issues with the DataParallel module, which I use to parallelize GPU training (I use 2 GPUs here). I am getting the following error:

  File "/path/to/run.py", line 64, in train
    out = model(data).view(-1)
  File "/n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 58, in forward
    inputs = self.scatter(data_list, self.device_ids)
  File "/n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 80, in scatter
    for i in range(len(split) - 1)
  File "/n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 80, in <listcomp>
    for i in range(len(split) - 1)
IndexError: list index out of range

This problem occurs on random epochs (here, it occurred on the 7th epoch) if I rerun the training, I am not sure why. Because I can run some number of epochs without error, it seems like it would probably be an error with the module and not the computation.

Do you know what might be causing this error?

rusty1s commented 3 years ago

That's weird and hard to reproduce without more information. It would be great if you could debug this error, in particular, what's the value of split.tolist() in line 80?

vymao commented 3 years ago

Is there a way I can find out? split.tolist() is in the module code.

rusty1s commented 3 years ago

You could use a debugger, or using print statements in /n/scratch3/users/v/vym1/nn/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py.

vymao commented 3 years ago

Ok. This error seems to occur randomly as well, so it seems difficult to track exclusively.