Open dr-stringfellow opened 3 years ago
I don't think that has anything to do with multi GPU support for torch_cluster.knn
. For example, the following code runs without issues for me:
import torch
from torch_geometric.datasets import MNISTSuperpixels
from torch_geometric.data import DataListLoader
from torch_geometric.nn import DataParallel
from torch_geometric.nn import knn_graph
dataset = MNISTSuperpixels('../data/MNIST')[:64]
loader = DataListLoader(dataset, batch_size=32, shuffle=True)
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
def forward(self, data):
_ = knn_graph(data.x, 6, data.batch)
return torch.tensor([1], device=data.x.device)
model = Net()
print('Let\'s use', torch.cuda.device_count(), 'GPUs!')
model = DataParallel(model)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
for data_list in loader:
output = model(data_list)
This has more to do with that the indices in batch_x
and batch_y
do not match, e.g., batch_x.max() != batch_y.max()
, which is not allowed.
Why would it work without the parallelization then? If I only run on one GPU, it runs over the entire dataset without any problems.
Seems to be related to data shuffling. You might wanna make sure that each example consists of at least one node before calling knn
.
🐛 Bug
torch_cluster fails in DynamicEdgeConv when parallelising over multiple GPUs. Training script with network architecture:
https://github.com/lgray/deepjet-geometric/blob/master/examples/puma_v1_train.py
Error:
To Reproduce
Steps to reproduce the behavior:
if torch.cuda.device_count() > 1: puma = nn.DataParallel(puma)
Expected behavior
Should work. But doesn't :) It does work on only one GPU.