Open hkim716 opened 4 years ago
I have a similar error when running GATConv models (see below). This is not an issue with the model as this happens even when I run the example GATConv model.
Traceback (most recent call last):
File "<Removed_Path>/src/main.py", line 285, in <module>
Train_GraphSAGE()
File "<Removed_Path>/src/main.py", line 164, in Train_GraphSAGE
out = Embedding_Model(GraphData.x, GraphData.edge_index)
File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "<Removed_Path>/src/models.py", line 21, in forward
x = self.conv1(x, EdgeIndex)
File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_geometric/nn/conv/gat_conv.py", line 96, in forward
return_attention_weights=return_attention_weights)
File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 263, in propagate
out = self.message(**msg_kwargs)
File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_geometric/nn/conv/gat_conv.py", line 120, in message
alpha = softmax(alpha, edge_index_i, size_i)
File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_geometric/utils/softmax.py", line 23, in softmax
out = src - scatter_max(src, index, dim=0, dim_size=num_nodes)[0][index]
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "~/anaconda2/envs/pygeometric/lib/python3.6/site-packages/torch_scatter/scatter.py", line 72, in scatter_max
dim_size: Optional[int] = None
) -> Tuple[torch.Tensor, torch.Tensor]:
return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: CUDA error: a PTX JIT compilation failed
I did a little bit of investigation and it seems that if the binary you are running is not compatible with the hardware architecture PTX JIT attempts to mitigate it (source). However, I can successfully train SAGEConv models. So, I'm not exactly sure what the actual cause might be.
Hi everyone and thanks for linking to the PyTorch thread. I'm not yet sure how to fix that on PyG side, but you have some options by yourself that should remove this error:
@torch.jit.script
calls in torch-scatter
and torch-sparse
, orPYTORCH_JIT=0
environment variable, see herePlease let me know if that fixes your issues.
@rusty1s I tried both approaches to no avail. I even tried setting @torch.jit.ignore
for the corresponding function and still the same issue is there. I'm at lost as to what might be causing this.
How did you perform (1) ?
@rusty1s I commented out all the @torch.jit.script
calls (just the decorator). I was under the impression that it will prohibit the creation of a ScriptFunction
by compiling the function body.
That is super weird. Does the error this time occur for scatter_max
or a different functionality? You may need to remove the calls in torch-cluster
and torch-spline-conv
too if you have those libraries installed.
@rusty1s It's the same error (I know, it's wired!). I have installed those two libraries as well. I'll report back after disabling all the @torch.jit.script
decorators.
@rusty1s Well, it's exactly the same error again. I'm going to try this on another server and see what happens.
@rusty1s Sorry for the late update. GATConv code is working in the other server albeit having the same cuda, pytorch and torch-geometric versions as in the server that I initially tried it on. Unfortunately, I have no clue what went wrong.
Yeah, that might be hardware dependent. Not yet sure how to fix :(
Edit: What I forgot, there are also some torch.jit.script
ops in PyG which might cause trouble. You can also test to remove those :(
@rusty1s WIll check that out as well. For what it's worth GATConv is not working on Titan X and it is working on RTX8000.
+1 The problem seems to be related to the Titan and CUDA. Only GAT seems to be affected (SAGE, GCN, CGCN work)
+1 it seems the error is occurred at scatter_max and scatter_min, not sure why this is not working for some GPU device.
Seems to be related to https://github.com/rusty1s/pytorch_scatter/issues/225#issuecomment-899623665. Will try to quickly fix this.
I'm trying to run mnist_nn_conv.py for MNISTSuperpixel example, but it gave me an error after the processing done. Error message is like this.
My questions are:
Please help me.