rusty1s / pytorch_cluster

PyTorch Extension Library of Optimized Graph Cluster Algorithms
MIT License
780 stars 142 forks source link

random_walk_cuda is causing an illegal memory access #176

Open ProfDoof opened 1 year ago

ProfDoof commented 1 year ago

Hi,

When running the following code, I get an illegal memory access error with the following graph. I am not sure why and do not understand the algorithm or C++ well enough to track it down. I do not get the error when I set device to 'cpu'.

I'm using the nightly build of pyg installed through a locally built conda package, and version 1.6.1 of PyTorch-cluster.

from torch_geometric.data import Data
from torch_geometric.utils import to_networkx
from networkx.drawing.nx_agraph import write_dot
import torch
new_node_ids = [x for x in range(7)]
sources = [
    0, 1, 2, 2, 4, 5,
]

targets = [
    1, 2, 3, 4, 5, 2,
]

data = Data(torch.tensor(new_node_ids), torch.tensor([sources, targets]))
data.num_nodes = 7

write_dot(to_networkx(data), 'test_test.dot')

device = 'cuda'
rowptr, col, perm = data.to(device).csr()
rowptr, col = rowptr[None], col[None]

print(rowptr, col)
start_indices = torch.arange(0, data.num_nodes, dtype=torch.long).flatten().to(device)

print(torch.ops.torch_cluster.random_walk(rowptr, col, start_indices,
                                 10, 2, 4))

EDIT:

Here's the error I get

tensor([0, 1, 2, 4, 4, 5, 6, 6], device='cuda:0') tensor([1, 2, 3, 4, 5, 2], device='cuda:0')
Traceback (most recent call last):
  File "/home/john/Research/EmbeddingGraphs/cfg2vec/gnn/test.py", line 25, in <module>
    print(torch.ops.torch_cluster.random_walk(rowptr, col, start_indices,
  File "/home/john/mambaforge/envs/gnn/lib/python3.9/site-packages/torch/_ops.py", line 503, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
rusty1s commented 1 year ago

This seems to be currently failing because node 6 is an isolated node, so data.num_nodes = 6 should fix this.

ProfDoof commented 1 year ago

This is a minimum example, the actual graph is more complicated and I can't remove the isolated nodes. Also, this doesn't fail for any other values of p or q. It also only happens in the CUDA version, not the CPU version. All that being said, I'm not sure what exactly is going on.

ProfDoof commented 1 year ago

@rusty1s just wanted to check if you had the chance to see this yet this evening.

rusty1s commented 1 year ago

Will take a look soon.

jsun57 commented 11 months ago

Wondering if there are any updates on this issue.

rusty1s commented 11 months ago

Not yet, sorry for the delay.

github-actions[bot] commented 5 months ago

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?