When I use Transformer_conv, the program reports the "RuntimeError: The following operation failed in the TorchScript interpreter."

psp3dcg commented 1 year ago

🐛 Describe the bug

I try to train a GNN network based on Transformer_Conv. It worked well on GPU RTX 3090 but failed on GPU RTX 4090. I also tested other GNN Convs like GCNConv or GATConv, GCNConv performed well but GATConv not. It seems the problem is the "alpha = softmax(alpha, index, ptr, size_i)". Could anyone give me some help?

 from torch_geometric.nn import TransformerConv
 class GraphModel(torch.nn.Module):
     def __init__(self, args):
        super(GraphModel, self).__init__()
        self.TConv = GATConv(self.nhid, self.nhid)
    def forward(self, data):
        x = self.TConv(x, edge_index)

-----Error Information-----
  File "/opt/conda/lib/python3.8/site-packages/torch_geometric/nn/conv/message_passing.py", line 317, in propagate
    out = self.message(**msg_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch_geometric/nn/conv/transformer_conv.py", line 216, in message
    alpha = softmax(alpha, index, ptr, size_i)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)

template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void fused_sub_exp(float* tsrc_1, float* tsrc_max_9, float* aten_exp) {
{
if ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)<21518ll ? 1 : 0) {
    float v = __ldg(tsrc_1 + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
    float v_1 = __ldg(tsrc_max_9 + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
    aten_exp[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = expf(v - v_1);
  }}
}

Environment

PyG version: 2.0.4
PyTorch version: 1.11.0
OS: Ubuntu 20.04
Python version:3.8
CUDA/cuDNN version:11.3
How you installed PyTorch and PyG (conda, pip, source):pip
Any other relevant information (e.g., version of torch-scatter): GPU RTX 4090, torch-scatter:2.0.9, torch-sparse:0.6.15

rusty1s commented 1 year ago

Thanks for the issue. What do you mean exactly with failing? I can't find any TorchScript relevant code in your example. Is this issue related to using TorchScript?

zmx1012 commented 11 months ago

你好我也遇到了相同的问题请问有解决方案了么

zmx1012 commented 11 months ago

Hello, I also encountered the same problem. Is there a solution? My code reported this error after two loops and then stopped running

akihironitta commented 11 months ago

@zmx1012 Would you mind sharing a minimal repro and your env details?

zmx1012 commented 11 months ago

cuda python pytorch cud RTX 4080 Hello, this is my environment configuration, is this the problem

akihironitta commented 11 months ago

@zmx1012 Thanks for sharing the env details. It looks like you're using a very old PyG version. I'd give it a try again with a newer PyG and see if it works. In case the issue still persists, it'd be great if you could share your minimal script and complete error to reproduce the behaviour.

pyg-team / pytorch_geometric

When I use Transformer_conv, the program reports the "RuntimeError: The following operation failed in the TorchScript interpreter." #7909

Environment