torch_cluster fails in DynamicEdgeConv when parallelising over multiple GPUs

dr-stringfellow commented 3 years ago

🐛 Bug

torch_cluster fails in DynamicEdgeConv when parallelising over multiple GPUs. Training script with network architecture:

https://github.com/lgray/deepjet-geometric/blob/master/examples/puma_v1_train.py

Error:

500 / 39000 550 / 39000 600 / 39000 Traceback (most recent call last): File "puma_v1_train.py", line 134, in loss = train() File "puma_v1_train.py", line 123, in train data.x_glob_batch) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 420, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 1 on device 1. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(input, kwargs) File "puma_v1_train.py", line 81, in forward feats1 = self.conv(x=(x_clus_enc, x_pfc_enc), batch=(batch_clus, batch_pfc)) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, **kwargs) File "/workspace/pytorch_geometric/torch_geometric/nn/conv/edge_conv.py", line 113, in forward num_workers=self.num_workers) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch_cluster/knn.py", line 75, in knn torch.cumsum(deg, 0, out=ptr_y[1:]) return torch.ops.torch_cluster.knn(x, y, ptr_x, ptr_y, k, cosine,
num_workers)
RuntimeError: ptr_x.value().numel() == ptr_y.value().numel() INTERNAL ASSERT FAILED at "/tmp/pip-req-build-g4jm3lqe/csrc/cuda/knn_cuda.cu":106, please report a bug to PyTorch. Input mismatch
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

To Reproduce

Steps to reproduce the behavior:

Download data file http://t3serv001.mit.edu/~bmaier/torchbug/TTBar_736.h5
Use the docker image (CUDA 11.0) docker://benediktmaier/torch-geometric:20.09-py3
Use this training script: https://github.com/lgray/deepjet-geometric/blob/master/examples/puma_v1_train.py
Add parallelisation after line 98:

if torch.cuda.device_count() > 1: puma = nn.DataParallel(puma)

Expected behavior

Should work. But doesn't :) It does work on only one GPU.

rusty1s commented 3 years ago

I don't think that has anything to do with multi GPU support for torch_cluster.knn. For example, the following code runs without issues for me:

import torch
from torch_geometric.datasets import MNISTSuperpixels
from torch_geometric.data import DataListLoader
from torch_geometric.nn import DataParallel
from torch_geometric.nn import knn_graph

dataset = MNISTSuperpixels('../data/MNIST')[:64]
loader = DataListLoader(dataset, batch_size=32, shuffle=True)

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()

    def forward(self, data):
        _ = knn_graph(data.x, 6, data.batch)
        return torch.tensor([1], device=data.x.device)

model = Net()
print('Let\'s use', torch.cuda.device_count(), 'GPUs!')
model = DataParallel(model)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

for data_list in loader:
    output = model(data_list)

This has more to do with that the indices in batch_x and batch_y do not match, e.g., batch_x.max() != batch_y.max(), which is not allowed.

dr-stringfellow commented 3 years ago

Why would it work without the parallelization then? If I only run on one GPU, it runs over the entire dataset without any problems.

rusty1s commented 3 years ago

Seems to be related to data shuffling. You might wanna make sure that each example consists of at least one node before calling knn.

pyg-team / pytorch_geometric