rusty1s / pytorch_sparse

PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations
MIT License
1.01k stars 147 forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #50

Closed quqxui closed 3 years ago

quqxui commented 4 years ago
  File "examples/sem_seg_sparse/train.py", line 142, in <module>
    main()
  File "examples/sem_seg_sparse/train.py", line 61, in main
    train(model, train_loader, optimizer, scheduler, criterion, opt)
  File "examples/sem_seg_sparse/train.py", line 79, in train
    out = model(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/drive/My Drive/deep_gcns_torch/examples/sem_seg_sparse/architecture.py", line 69, in forward
    feats.append(self.gunet(feats[-1],edge_index=edge_index ,batch=batch))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 83, in forward
    x.size(0))
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 120, in augment_adj
    num_nodes)
  File "/usr/local/lib/python3.6/dist-packages/torch_sparse/spspmm.py", line 30, in spspmm
    C = matmul(A, B)
  File "/usr/local/lib/python3.6/dist-packages/torch_sparse/matmul.py", line 107, in matmul
    return spspmm(src, other, reduce)
  File "/usr/local/lib/python3.6/dist-packages/torch_sparse/matmul.py", line 95, in spspmm
    return spspmm_sum(src, other)
  File "/usr/local/lib/python3.6/dist-packages/torch_sparse/matmul.py", line 83, in spspmm_sum
    rowptrA, colA, valueA, rowptrB, colB, valueB, K)
RuntimeError: CUDA error: an illegal memory access was encountered (launch_kernel at /pytorch/aten/src/ATen/native/cuda/Loops.cuh:103)

hi, i'm intergrating the GraphU-Net and other model on the google colab, but there are some bug , could you help me ? thanks.

rusty1s commented 4 years ago

I will look into it.

rusty1s commented 4 years ago

Can you show me an example code? For example, my GraphU-Net script runs just fine. Note that you need to pass coalesced=True if your edge_index is not sorted.

vthost commented 4 years ago

I have the same problem it seems. @rusty1s Where exactly do I have to pass coalesced=True?

rusty1s commented 4 years ago

That was meant for the spspmm call inside the Graph-U-Net model, but it's actually not needed since augment_adj takes already care of that.

Do you have a similar problem?

vthost commented 4 years ago

Yes. I get exactly the same error with GraphU-Net, and I have no idea why...

rusty1s commented 4 years ago

Can you do me a favor and run the torch-sparse test suite and see if that succeeds?

vthost commented 4 years ago

The tests run (apart from the ...metis) and, after running "python setup.py test" in the torch_sparse github project while having the anaconda environment where torch_sparse was already installed activated, the above error first seemed to have disappeared. But I now have it again. On CPU it is running w/o errors.

rusty1s commented 4 years ago

Do you have any insights on why it re-occurs? Does the following work for you?

from torch_sparse import SparseTensor
from torch_geometric.datasets import Planetoid
data = Planetoid('/tmp/Planetoid', 'Cora')[0]
row, col = data.edge_index.cuda()
adj = SparseTensor(row=row, col=col)
out = adj @ adj
vthost commented 4 years ago

No, I have no idea. But the code you posted runs without issues.

vthost commented 4 years ago

Ok, so the error now is actually not the same anymore (but seems related?). I guess it changed after running the sparse tests including the build. Now it's in the pooling after augmenting the adjacency, but it seems that the latter is the problem.

Traceback (most recent call last):
  ...
  File ".../ogb_examples/graphproppred/code/gnn2.py", line 528, in compute_message_layers
    return self.unet(feat, edge_index, batch)
  File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch_geometric/nn/models/graph_unet.py", line 86, in forward
    x, edge_index, edge_weight, batch)
  File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch_geometric/nn/pool/topk_pool.py", line 158, in forward
    num_nodes=score.size(0))
  File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch_geometric/nn/pool/topk_pool.py", line 60, in filter_adj
    row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

Looking into the shapes in graph_unet as

 for i in range(1, self.depth + 1):  
            print(i,x.shape,edge_index.shape,max(edge_index[0]),max(edge_index[1]))
            edge_index, edge_weight = self.augment_adj(edge_index, edge_weight,
                                                       x.size(0))
            print(i,x.shape,edge_index.shape,max(edge_index[0]),max(edge_index[1]))
            x, edge_index, edge_weight, batch, perm, _ = self.pools[i - 1](
                x, edge_index, edge_weight, batch)

I get

1 torch.Size([22225, 300]) torch.Size([2, 61178]) tensor(22224, device='cuda:0') tensor(22224, device='cuda:0')
1 torch.Size([22225, 300]) torch.Size([2, 197014]) tensor(22224, device='cuda:0') tensor(493906172, device='cuda:0')

Then the numbers in edge_index are too large in filter_adj in topk_pool l58 (in the message it's l60 because I added prints). I hope that helps. I think I cannot resolve it alone since I am not fully sure about the meaning/internals of spmm in augment_adj. It runs if I comment the augment_adj call out.

rusty1s commented 4 years ago

Yeah, it looks like the spspmm call does not work for you since it does not compute edge_index[1] correctly. It's max value should also not exceed 22224. Not sure how to fix it though :(

vthost commented 4 years ago

I am using ogbg-code. The example code for that data adds two types of edges to the graph in utils.augment_edge, so we might have several edges between two nodes. I tried to add coalesced=True in graph_unet.augment_adj as argument to spspmm but the error is still the same. It seems that spspmm interprets the coalesced argument as "sorted". After I added the following in the beginning of graph_unet.forward (after the initialization of the edge weights), it runs for 74/143 epochs, and then the error comes again. If I add it in graph_unet.augment_adj, the training runs through, but I get the same error in the evaluation in remove_self_loops because the mask does not fit edge_attr[mask]. Just as an update... edge_index, edge_weight = coalesce(edge_index, edge_weight, x.shape[0], x.shape[0])

vthost commented 4 years ago

I now have this also with ASAPool :(

Screen Shot 2020-08-16 at 5 40 36 PM
Flawless1202 commented 4 years ago

@vthost @rusty1s Hi, I also met this error when use my own dataset to train Graph-UNet. This error randomly occurred when using GPU but never occurred when using CPU. I changed the augment_adj function, added the remove_self_loops function at first, and then the problem was solved. But I don't know why.

def augment_adj(self, edge_index, edge_weight, num_nodes):
    edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
    edge_index, edge_weight = add_self_loops(edge_index, edge_weight, num_nodes=num_nodes)
    edge_index, edge_weight = sort_edge_index(edge_index, edge_weight, num_nodes)
    edge_index, edge_weight = spspmm(edge_index, edge_weight, edge_index, edge_weight, num_nodes, num_nodes, num_nodes)
    edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
    return edge_index, edge_weight
rusty1s commented 4 years ago

The error seems to stem from the fact cuSPARSE cannot handle duplicated edges in edge_index. The reason for that is that it fails to compute the correct amount of output edges this way. In your case, it might well be that you have some initial self-loop edges in your graph, which should be removed before calling add_self_loops. I think your fix for augment_adj is correct, and I added it to the GraphUNet model in PyG.

Flawless1202 commented 4 years ago

OK, @rusty1s Thx a lot!

vthost commented 4 years ago

I have duplicate edges but no self loops. For ASAPool calling coalesce in the beginning of forward seems to work regarding the above error. But then I receive a CUDA out of memory after a few epochs. Since the latter occurs even with batch size 2, I am not sure if that's due to the model or connected to the above?

rusty1s commented 4 years ago

I don't think that's related to the above issue. You may have a memory leak somewhere, or one of your graphs in your dataset is too large that it can not be handled in a full-batch fashion.

github-actions[bot] commented 3 years ago

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?

andreimargeloiu commented 1 year ago

similar to @vthost , I had this issue (with spspmm_sum) when using the ASAPool layer.

The solution was to do edge_index = tg.utils.coalesce(edge_index, is_sorted=False) before passing edge_index through the ASAPool layer