pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.2k stars 3.64k forks source link

Runtime error after Processing... #1576

Open hkim716 opened 4 years ago

hkim716 commented 4 years ago

I'm trying to run mnist_nn_conv.py for MNISTSuperpixel example, but it gave me an error after the processing done. Error message is like this.

Processing...
Done!
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-e30f7779dce8> in <module>
     96 
     97 for epoch in range(1, 31):
---> 98     train(epoch)
     99     test_acc = test()
    100     print('Epoch: {:02d}, Test: {:.4f}'.format(epoch, test_acc))

<ipython-input-2-e30f7779dce8> in train(epoch)
     80         data = data.to(device)
     81         optimizer.zero_grad()
---> 82         F.nll_loss(model(data), data.y).backward()
     83         optimizer.step()
     84 

~/miniconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

<ipython-input-2-e30f7779dce8> in forward(self, data)
     46         data.x = F.elu(self.conv1(data.x, data.edge_index, data.edge_attr))
     47         weight = normalized_cut_2d(data.edge_index, data.pos)
---> 48         cluster = graclus(data.edge_index, weight, data.x.size(0))
     49         data.edge_attr = None
     50         data = max_pool(cluster, data, transform=transform)

~/miniconda3/envs/torch/lib/python3.7/site-packages/torch_geometric/nn/pool/graclus.py in graclus(edge_index, weight, num_nodes)
     33         raise ImportError('`graclus` requires `torch-cluster`.')
     34 
---> 35     return graclus_cluster(edge_index[0], edge_index[1], weight, num_nodes)

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/hkimlx/miniconda3/envs/torch/lib/python3.7/site-packages/torch_cluster/graclus.py", line 59, in graclus_cluster
    torch.cumsum(deg, 0, out=rowptr[1:])

    return torch.ops.torch_cluster.graclus(rowptr, col, weight)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: CUDA error: PTX JIT compiler library not found

My questions are:

  1. What is the meaning of the Processing Done?
  2. Where did the CUDA error come from? I don't understand what PTX JIT library is.

Please help me.

tharindurmt commented 4 years ago

I have a similar error when running GATConv models (see below). This is not an issue with the model as this happens even when I run the example GATConv model.

Traceback (most recent call last):
File "<Removed_Path>/src/main.py", line 285, in <module>
    Train_GraphSAGE()
File "<Removed_Path>/src/main.py", line 164, in Train_GraphSAGE
    out = Embedding_Model(GraphData.x, GraphData.edge_index)
  File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "<Removed_Path>/src/models.py", line 21, in forward
    x = self.conv1(x, EdgeIndex)
  File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_geometric/nn/conv/gat_conv.py", line 96, in forward
    return_attention_weights=return_attention_weights)
  File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 263, in propagate
    out = self.message(**msg_kwargs)
  File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_geometric/nn/conv/gat_conv.py", line 120, in message
    alpha = softmax(alpha, edge_index_i, size_i)
  File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_geometric/utils/softmax.py", line 23, in softmax
    out = src - scatter_max(src, index, dim=0, dim_size=num_nodes)[0][index]
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_scatter/scatter.py", line 72, in scatter_max
                dim_size: Optional[int] = None
                ) -> Tuple[torch.Tensor, torch.Tensor]:
    return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: CUDA error: a PTX JIT compilation failed

I did a little bit of investigation and it seems that if the binary you are running is not compatible with the hardware architecture PTX JIT attempts to mitigate it (source). However, I can successfully train SAGEConv models. So, I'm not exactly sure what the actual cause might be.

rusty1s commented 4 years ago

Hi everyone and thanks for linking to the PyTorch thread. I'm not yet sure how to fix that on PyG side, but you have some options by yourself that should remove this error:

  1. Remove all @torch.jit.script calls in torch-scatter and torch-sparse, or
  2. Disable PyTorch JIT mode via PYTORCH_JIT=0 environment variable, see here

Please let me know if that fixes your issues.

tharindurmt commented 4 years ago

@rusty1s I tried both approaches to no avail. I even tried setting @torch.jit.ignore for the corresponding function and still the same issue is there. I'm at lost as to what might be causing this.

rusty1s commented 4 years ago

How did you perform (1) ?

tharindurmt commented 4 years ago

@rusty1s I commented out all the @torch.jit.script calls (just the decorator). I was under the impression that it will prohibit the creation of a ScriptFunction by compiling the function body.

rusty1s commented 4 years ago

That is super weird. Does the error this time occur for scatter_max or a different functionality? You may need to remove the calls in torch-cluster and torch-spline-conv too if you have those libraries installed.

tharindurmt commented 4 years ago

@rusty1s It's the same error (I know, it's wired!). I have installed those two libraries as well. I'll report back after disabling all the @torch.jit.script decorators.

tharindurmt commented 4 years ago

@rusty1s Well, it's exactly the same error again. I'm going to try this on another server and see what happens.

tharindurmt commented 4 years ago

@rusty1s Sorry for the late update. GATConv code is working in the other server albeit having the same cuda, pytorch and torch-geometric versions as in the server that I initially tried it on. Unfortunately, I have no clue what went wrong.

rusty1s commented 4 years ago

Yeah, that might be hardware dependent. Not yet sure how to fix :(

Edit: What I forgot, there are also some torch.jit.script ops in PyG which might cause trouble. You can also test to remove those :(

tharindurmt commented 4 years ago

@rusty1s WIll check that out as well. For what it's worth GATConv is not working on Titan X and it is working on RTX8000.

netphantom commented 4 years ago

+1 The problem seems to be related to the Titan and CUDA. Only GAT seems to be affected (SAGE, GCN, CGCN work)

LingxiaoShawn commented 3 years ago

+1 it seems the error is occurred at scatter_max and scatter_min, not sure why this is not working for some GPU device.

rusty1s commented 3 years ago

Seems to be related to https://github.com/rusty1s/pytorch_scatter/issues/225#issuecomment-899623665. Will try to quickly fix this.