Closed quqxui closed 3 years ago
I will look into it.
Can you show me an example code? For example, my GraphU-Net script runs just fine. Note that you need to pass coalesced=True
if your edge_index
is not sorted.
I have the same problem it seems. @rusty1s Where exactly do I have to pass coalesced=True?
That was meant for the spspmm
call inside the Graph-U-Net model, but it's actually not needed since augment_adj
takes already care of that.
Do you have a similar problem?
Yes. I get exactly the same error with GraphU-Net, and I have no idea why...
Can you do me a favor and run the torch-sparse
test suite and see if that succeeds?
The tests run (apart from the ...metis) and, after running "python setup.py test" in the torch_sparse github project while having the anaconda environment where torch_sparse was already installed activated, the above error first seemed to have disappeared. But I now have it again. On CPU it is running w/o errors.
Do you have any insights on why it re-occurs? Does the following work for you?
from torch_sparse import SparseTensor
from torch_geometric.datasets import Planetoid
data = Planetoid('/tmp/Planetoid', 'Cora')[0]
row, col = data.edge_index.cuda()
adj = SparseTensor(row=row, col=col)
out = adj @ adj
No, I have no idea. But the code you posted runs without issues.
Ok, so the error now is actually not the same anymore (but seems related?). I guess it changed after running the sparse tests including the build. Now it's in the pooling after augmenting the adjacency, but it seems that the latter is the problem.
Traceback (most recent call last):
...
File ".../ogb_examples/graphproppred/code/gnn2.py", line 528, in compute_message_layers
return self.unet(feat, edge_index, batch)
File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch_geometric/nn/models/graph_unet.py", line 86, in forward
x, edge_index, edge_weight, batch)
File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch_geometric/nn/pool/topk_pool.py", line 158, in forward
num_nodes=score.size(0))
File "/opt/anaconda3/envs/dagnn/lib/python3.7/site-packages/torch_geometric/nn/pool/topk_pool.py", line 60, in filter_adj
row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
Looking into the shapes in graph_unet as
for i in range(1, self.depth + 1):
print(i,x.shape,edge_index.shape,max(edge_index[0]),max(edge_index[1]))
edge_index, edge_weight = self.augment_adj(edge_index, edge_weight,
x.size(0))
print(i,x.shape,edge_index.shape,max(edge_index[0]),max(edge_index[1]))
x, edge_index, edge_weight, batch, perm, _ = self.pools[i - 1](
x, edge_index, edge_weight, batch)
I get
1 torch.Size([22225, 300]) torch.Size([2, 61178]) tensor(22224, device='cuda:0') tensor(22224, device='cuda:0')
1 torch.Size([22225, 300]) torch.Size([2, 197014]) tensor(22224, device='cuda:0') tensor(493906172, device='cuda:0')
Then the numbers in edge_index are too large in filter_adj in topk_pool l58 (in the message it's l60 because I added prints). I hope that helps. I think I cannot resolve it alone since I am not fully sure about the meaning/internals of spmm in augment_adj. It runs if I comment the augment_adj call out.
Yeah, it looks like the spspmm
call does not work for you since it does not compute edge_index[1]
correctly. It's max
value should also not exceed 22224
. Not sure how to fix it though :(
I am using ogbg-code
. The example code for that data adds two types of edges to the graph in utils.augment_edge
, so we might have several edges between two nodes. I tried to add coalesced=True
in graph_unet.augment_adj
as argument to spspmm
but the error is still the same. It seems that spspmm
interprets the coalesced
argument as "sorted". After I added the following in the beginning of graph_unet.forward
(after the initialization of the edge weights), it runs for 74/143 epochs, and then the error comes again. If I add it in graph_unet.augment_adj
, the training runs through, but I get the same error in the evaluation in remove_self_loops
because the mask does not fit edge_attr[mask]
.
Just as an update...
edge_index, edge_weight = coalesce(edge_index, edge_weight, x.shape[0], x.shape[0])
I now have this also with ASAPool :(
@vthost @rusty1s Hi, I also met this error when use my own dataset to train Graph-UNet
. This error randomly occurred when using GPU but never occurred when using CPU. I changed the augment_adj
function, added the remove_self_loops
function at first, and then the problem was solved. But I don't know why.
def augment_adj(self, edge_index, edge_weight, num_nodes):
edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
edge_index, edge_weight = add_self_loops(edge_index, edge_weight, num_nodes=num_nodes)
edge_index, edge_weight = sort_edge_index(edge_index, edge_weight, num_nodes)
edge_index, edge_weight = spspmm(edge_index, edge_weight, edge_index, edge_weight, num_nodes, num_nodes, num_nodes)
edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
return edge_index, edge_weight
The error seems to stem from the fact cuSPARSE cannot handle duplicated edges in edge_index
. The reason for that is that it fails to compute the correct amount of output edges this way. In your case, it might well be that you have some initial self-loop edges in your graph, which should be removed before calling add_self_loops
. I think your fix for augment_adj
is correct, and I added it to the GraphUNet
model in PyG.
OK, @rusty1s Thx a lot!
I have duplicate edges but no self loops. For ASAPool calling coalesce
in the beginning of forward seems to work regarding the above error. But then I receive a CUDA out of memory after a few epochs. Since the latter occurs even with batch size 2, I am not sure if that's due to the model or connected to the above?
I don't think that's related to the above issue. You may have a memory leak somewhere, or one of your graphs in your dataset is too large that it can not be handled in a full-batch fashion.
This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?
similar to @vthost , I had this issue (with spspmm_sum
) when using the ASAPool layer.
The solution was to do edge_index = tg.utils.coalesce(edge_index, is_sorted=False)
before passing edge_index through the ASAPool layer
hi, i'm intergrating the GraphU-Net and other model on the google colab, but there are some bug , could you help me ? thanks.