pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.5k stars 3.69k forks source link

Bug on 2080ti with CUDA error: CUBLAS_STATUS_INTERNAL_ERROR #2570

Closed nicolasH1027 closed 3 years ago

nicolasH1027 commented 3 years ago

🐛 Bug

To Reproduce

My model looks the following, for the basicmodule, its just a class that with torch.load and torch.save method.

class CovNet(BasicModule):
    '''
    Graph Convolution Network
    '''
    def __init__(self,
        num_feature: int = 16,            # number of features
        hidden_channels: int = 16,        # number of hidden channel
        num_class: int = 7,               # number of classes
        num_cov: int = 5,                 # number of convolution layer
        act = nn.ReLU(),                  # activation function
        p: float = None,                  # dropout rate     
    ):
        super(CovNet, self).__init__() 

        self.num_cov = num_cov
        self.p = p
        self.act = act

        for i in range(self.num_cov):
            setattr(self,f'conv{i}', GCNConv(num_feature, hidden_channels, cached=True))        # because its the node classification, so cached = True

        if self.p:
            self.drop = nn.Dropout(p = self.p)

        self.output = GCNConv(hidden_channels, num_class, cached=True)

    def forward(self, x: Tensor, edge_index) -> Tensor:     
        for i in range(self.num_cov):
            x = getattr(self, f'conv{i}')(x, edge_index)
            x = self.act(x)
            if self.p:
                x = self.drop(x)

        x = self.output(x, edge_index)
        return x

Error

If I run the code on 1080ti, everything is fine. However, if I run the code on 2080ti, I got the following error

Traceback (most recent call last):
  File "main.py", line 84, in <module>
    fire.Fire()
  File "/afs/crc.nd.edu/user/z/zhu4/.local/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/afs/crc.nd.edu/user/z/zhu4/.local/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/afs/crc.nd.edu/user/z/zhu4/.local/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "main.py", line 44, in train
    out = model(data.x, data.edge_index)  
  File "/afs/crc.nd.edu/user/z/zhu4/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/afs/crc.nd.edu/user/z/zhu4/GraphNeuralNetwork/models/net.py", line 37, in forward
    x = getattr(self, f'conv{i}')(x, edge_index)
  File "/afs/crc.nd.edu/user/z/zhu4/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/afs/crc.nd.edu/user/z/zhu4/.local/lib/python3.7/site-packages/torch_geometric/nn/conv/gcn_conv.py", line 179, in forward
    x = x @ self.weight
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

Environment

Additional context

It's really wield that the code can run on 1080ti but not on 2080ti. The dataset i used here is cora.

rusty1s commented 3 years ago

Are you sure this is related to PyG? Does your model work as expected when swapping out GCNConv with torch.nn.Linear? I'm asking because it seems that there exists multiple people with this issue using PyTorch >= 1.8.0, see https://discuss.pytorch.org/t/cuda-error-cublas-status-internal-error-when-calling-cublascreate-handle/114341/4.

nicolasH1027 commented 3 years ago

Hi, thanks for your suggestion. I think its the pytorch issue. By the way, does the pytorch_geometric support torch1.8.1+cu102? I'm thinking about which version should i switch to.

Are you sure this is related to PyG? Does your model work as expected when swapping out GCNConv with torch.nn.Linear? I'm asking because it seems that there exists multiple people with this issue using PyTorch >= 1.8.0, see https://discuss.pytorch.org/t/cuda-error-cublas-status-internal-error-when-calling-cublascreate-handle/114341/4.

rusty1s commented 3 years ago

Yes, PyTorch 1.8.1 is fully-supported by the 1.8.0 wheels.