cuDNN error: CUDNN_STATUS_INTERNAL_ERROR on cudnn.benchmark = True

moderatelyfunctional commented 5 years ago

Hi, I'm running PyTorch on an encoder/decoder architecture and am having a problem with cuDNN.

If I include

import torch.backends.cudnn as cudnn
cudnn.benchmark = True

in my Python code, then I receive an error for CUDNN_STATUS_INTERNAL_ERROR.

The full stack trace is listed here.

    main()
  File "main.py", line 152, in main
    train_model(train_dataloader, val_dataloader)
  File "main.py", line 131, in train_model
    network_output = network(network_input)
  File "/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/model.py", line 133, in forward
    x = self.decoder(x)
  File "/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "model.py", line 54, in forward
    x = self.layer4(x)
  File "/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/.local/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 757, in forward
    output_padding, self.groups, self.dilation)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

If I comment out cuDNN, I can run the code without any problems.

My system configurations are listed below.

PyTorch Version: 1.0.1
OS: Ubuntu 16.04
PyTorch 1.0.1 installed from pip3
Python version: 3.5
CUDA/cuDNN version: 10.0/7.402
GPU models and configuration: Nvidia GPU Titan X

Additional context

I've tried rm -rf ~/.nv, rebuilt CUDA/cuDNN and reinstalled PyTorch but still cannot get it to work. Thanks for your help!

soumith commented 5 years ago

can you confirm that v1.0.0 works fine?

moderatelyfunctional commented 5 years ago

It also fails for v1.0.0. I installed it via pip3.

soumith commented 5 years ago

in that case, this is pretty weird. It shouldn't be failing on a Titan-X. I see that you have CUDA10 installed on the system. Can you try installing our CUDA10 wheel? The command is at https://pytorch.org (use the selector wizard in get started and select pip and cuda10)

soumith commented 5 years ago

I uploaded new 1.0.1.post2 binaries, can you give those a shot?

dmenig commented 5 years ago

Same problem on current official 1.0.1 when pushing batch size to more than 3 per gpu (on multi gpu context). Works when batch_size/ngpu <= 3. Problem only with Group norm layers

soumith commented 5 years ago

I'd like to investigate this, any chance you can give me a small code snippet that's failing?

FabianIsensee commented 5 years ago

I have the exact same problem. It works on pytorch 0.4.1 (cuDNN 7104) It does not work on pytorch 1.0.1.post2 (current stable release, cuDNN 7402) It does not work on pytorch 1.0.0.dev20190220 (current nightly, cuDNN 7402)

My model is a very large 3D UNet that takes 224x224x224 shaped inputs. If you want to try it you will need 32GB of VRAM. It will not crash with smaller inputs such as 128x128x128 (which you could fit on your regular 12GB card).

Run this snippet to reprocude:

from torch.backends import cudnn
from copy import deepcopy
from torch import nn
import torch
import numpy as np
import torch.nn.functional

class ConvDropoutNormNonlin(nn.Module):
    def __init__(self, input_channels, output_channels,
                 conv_op=nn.Conv2d, conv_kwargs=None,
                 norm_op=nn.BatchNorm2d, norm_op_kwargs=None,
                 dropout_op=nn.Dropout2d, dropout_op_kwargs=None,
                 nonlin=nn.LeakyReLU, nonlin_kwargs=None):
        super(ConvDropoutNormNonlin, self).__init__()
        if nonlin_kwargs is None:
            nonlin_kwargs = {'negative_slope': 1e-2, 'inplace': True}
        if dropout_op_kwargs is None:
            dropout_op_kwargs = {'p': 0.5, 'inplace': True}
        if norm_op_kwargs is None:
            norm_op_kwargs = {'eps': 1e-5, 'affine': True, 'momentum': 0.1}
        if conv_kwargs is None:
            conv_kwargs = {'kernel_size': 3, 'stride': 1, 'padding': 1, 'dilation': 1, 'bias': True}

        self.nonlin_kwargs = nonlin_kwargs
        self.nonlin = nonlin
        self.dropout_op = dropout_op
        self.dropout_op_kwargs = dropout_op_kwargs
        self.norm_op_kwargs = norm_op_kwargs
        self.conv_kwargs = conv_kwargs
        self.conv_op = conv_op
        self.norm_op = norm_op

        self.conv = self.conv_op(input_channels, output_channels, **self.conv_kwargs)
        if self.dropout_op is not None and self.dropout_op_kwargs['p'] is not None and self.dropout_op_kwargs[
            'p'] > 0:
            self.dropout = self.dropout_op(**self.dropout_op_kwargs)
        else:
            self.dropout = None
        self.instnorm = self.norm_op(output_channels, **self.norm_op_kwargs)
        self.lrelu = nn.LeakyReLU(**self.nonlin_kwargs)

    def forward(self, x):
        x = self.conv(x)
        if self.dropout is not None:
            x = self.dropout(x)
        return self.lrelu(self.instnorm(x))

class StackedConvLayers(nn.Module):
    def __init__(self, input_feature_channels, output_feature_channels, num_convs,
                 conv_op=nn.Conv2d, conv_kwargs=None,
                 norm_op=nn.BatchNorm2d, norm_op_kwargs=None,
                 dropout_op=nn.Dropout2d, dropout_op_kwargs=None,
                 nonlin=nn.LeakyReLU, nonlin_kwargs=None, first_stride=None):
        self.input_channels = input_feature_channels
        self.output_channels = output_feature_channels

        if nonlin_kwargs is None:
            nonlin_kwargs = {'negative_slope': 1e-2, 'inplace': True}
        if dropout_op_kwargs is None:
            dropout_op_kwargs = {'p': 0.5, 'inplace': True}
        if norm_op_kwargs is None:
            norm_op_kwargs = {'eps': 1e-5, 'affine': True, 'momentum': 0.1}
        if conv_kwargs is None:
            conv_kwargs = {'kernel_size': 3, 'stride': 1, 'padding': 1, 'dilation': 1, 'bias': True}

        self.nonlin_kwargs = nonlin_kwargs
        self.nonlin = nonlin
        self.dropout_op = dropout_op
        self.dropout_op_kwargs = dropout_op_kwargs
        self.norm_op_kwargs = norm_op_kwargs
        self.conv_kwargs = conv_kwargs
        self.conv_op = conv_op
        self.norm_op = norm_op

        if first_stride is not None:
            self.conv_kwargs_first_conv = deepcopy(conv_kwargs)
            self.conv_kwargs_first_conv['stride'] = first_stride
        else:
            self.conv_kwargs_first_conv = conv_kwargs

        super(StackedConvLayers, self).__init__()
        self.blocks = nn.Sequential(
            *([ConvDropoutNormNonlin(input_feature_channels, output_feature_channels, self.conv_op,
                                     self.conv_kwargs_first_conv,
                                     self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs,
                                     self.nonlin, self.nonlin_kwargs)] +
              [ConvDropoutNormNonlin(output_feature_channels, output_feature_channels, self.conv_op,
                                     self.conv_kwargs,
                                     self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs,
                                     self.nonlin, self.nonlin_kwargs) for _ in range(num_convs - 1)]))

    def forward(self, x):
        return self.blocks(x)

class Generic_UNet(nn.Module):
    def __init__(self, input_channels, base_num_features, num_classes, num_pool):
        super(Generic_UNet, self).__init__()

        self.nonlin_kwargs = {'negative_slope':1e-2, 'inplace':True}
        self.dropout_op_kwargs = {'p':0.0, 'inplace':True}
        self.norm_op_kwargs = {'eps':1e-5, 'affine':True}
        self.conv_kwargs = {'kernel_size':3,'padding':1, 'stride':1, 'dilation':1, 'bias':True}
        self.nonlin = nn.ReLU
        self.conv_op = nn.Conv3d
        self.norm_op = nn.InstanceNorm3d
        self.dropout_op = nn.Dropout3d
        self.num_classes = num_classes

        transpconv = nn.ConvTranspose3d

        self.conv_blocks_encoder = []
        self.conv_blocks_decoder = []
        self.transpConvs = []

        # encoder
        output_features = base_num_features
        input_features = input_channels
        for d in range(num_pool):
            # determine the first stride
            if d != 0:
                first_stride = 2
            else:
                first_stride = 1

            self.conv_blocks_encoder.append(StackedConvLayers(input_features, output_features, 2,
                                                              self.conv_op, self.conv_kwargs, self.norm_op,
                                                              self.norm_op_kwargs, self.dropout_op,
                                                              self.dropout_op_kwargs, self.nonlin, self.nonlin_kwargs,
                                                              first_stride))
            input_features = output_features
            output_features = int(np.round(output_features * 2))
            output_features = min(output_features, 480) # no more filters, otherwise we explode in num parameters

        # now the bottleneck.
        first_stride = 2
        final_num_features = output_features
        self.conv_blocks_encoder.append(nn.Sequential(
            StackedConvLayers(input_features, output_features, 2 - 1, self.conv_op, self.conv_kwargs,
                              self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs, self.nonlin,
                              self.nonlin_kwargs, first_stride),
            StackedConvLayers(output_features, final_num_features, 1, self.conv_op, self.conv_kwargs,
                              self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs, self.nonlin,
                              self.nonlin_kwargs)))

        # now lets build the decoder pathway
        for u in range(num_pool):
            nfeatures_from_down = final_num_features
            nfeatures_from_skip = self.conv_blocks_encoder[-(2 + u)].output_channels # self.conv_blocks_context[-1] is bottleneck, so start with -2
            n_features_after_tu_and_concat = nfeatures_from_skip * 2

            final_num_features = nfeatures_from_skip

            self.transpConvs.append(transpconv(nfeatures_from_down, nfeatures_from_skip, 2, 2, bias=False))

            self.conv_blocks_decoder.append(nn.Sequential(
                StackedConvLayers(n_features_after_tu_and_concat, nfeatures_from_skip, 2 - 1,
                                  self.conv_op, self.conv_kwargs, self.norm_op, self.norm_op_kwargs, self.dropout_op,
                                  self.dropout_op_kwargs, self.nonlin, self.nonlin_kwargs),
                StackedConvLayers(nfeatures_from_skip, final_num_features, 1, self.conv_op, self.conv_kwargs,
                                  self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs,
                                  self.nonlin, self.nonlin_kwargs)
            ))

        self.seg_output = self.conv_op(base_num_features, num_classes, 1, 1, 0, 1, 1, False)

        # register all modules properly
        self.conv_blocks_decoder = nn.ModuleList(self.conv_blocks_decoder)
        self.conv_blocks_encoder = nn.ModuleList(self.conv_blocks_encoder)
        self.transpConvs = nn.ModuleList(self.transpConvs)

    def forward(self, x):
        skips = []
        for d in range(len(self.conv_blocks_encoder) - 1):
            x = self.conv_blocks_encoder[d](x)
            skips.append(x)

        x = self.conv_blocks_encoder[-1](x)

        for u in range(len(self.transpConvs)):
            x = self.transpConvs[u](x)
            x = torch.cat((x, skips[-(u + 1)]), dim=1)
            x = self.conv_blocks_decoder[u](x)

        seg_output = self.seg_output(x)

        return seg_output

if __name__ == "__main__":
    cudnn.benchmark = True

    net = Generic_UNet(1, 30, 3, 5).cuda()

    a = torch.rand((1, 1, 224, 224, 224)).pin_memory().cuda()

    res = net(a)
    loss = res.sum()

    loss.backward()

Set cudnn.benchmark = False and this example will run on all pytorch versions. I hope you can help. This seems to be a tricky one... :-)

Best, Fabian

Edit: I printed the filter map sizes for your convenience:

encoder 0 x.shape: torch.Size([1, 30, 224, 224, 224]) encoder 1 x.shape: torch.Size([1, 60, 112, 112, 112]) encoder 2 x.shape: torch.Size([1, 120, 56, 56, 56]) encoder 3 x.shape: torch.Size([1, 240, 28, 28, 28]) encoder 4 x.shape: torch.Size([1, 480, 14, 14, 14]) bottleneck, x.shape torch.Size([1, 480, 7, 7, 7]) transpconv 0 x.shape torch.Size([1, 480, 14, 14, 14]) decoder 0 x.shape torch.Size([1, 480, 14, 14, 14]) transpconv 1 x.shape torch.Size([1, 240, 28, 28, 28]) decoder 1 x.shape torch.Size([1, 240, 28, 28, 28]) transpconv 2 x.shape torch.Size([1, 120, 56, 56, 56]) decoder 2 x.shape torch.Size([1, 120, 56, 56, 56]) transpconv 3 x.shape torch.Size([1, 60, 112, 112, 112]) decoder 3 x.shape torch.Size([1, 60, 112, 112, 112]) transpconv 4 x.shape torch.Size([1, 30, 224, 224, 224]) decoder 4 x.shape torch.Size([1, 30, 224, 224, 224]) segmentation_output.shape torch.Size([1, 3, 224, 224, 224])

I also did some additional experiments:

In [2]: torch.__version__                                                                                           
Out[2]: '1.0.0.dev20190220

32x32x32 -> works
64x64x64 -> works
96x96x96 -> works
128x128x128 -> works
160x160x160 -> works
192x192x192 -> works
224x224x224 -> CUDNN_STATUS_INTERNAL_ERROR
256x256x256 -> works (you need to use 20 base_num_features, otherwise it's too large)

224x224x224 with 20 base_num_features -> CUDNN_STATUS_INTERNAL_ERROR
224x224x224 with 24 base_num_features -> CUDNN_STATUS_INTERNAL_ERROR
224x224x224 with 4 base_num_features -> CUDNN_STATUS_INTERNAL_ERROR
224x224x224 with 32 base_num_features -> CUDNN_STATUS_INTERNAL_ERROR

cudnn.benchmark = False -> works

Interestingly the error only appears on 224x224x224 and not on smaller or larger inputs. I am running the 410.79 driver and Cuda-10.0. All experiments done on a V100 32GB card.

The error message is this one:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-2db53cb1edfd> in <module>
    205     loss = res.sum()
    206 
--> 207     loss.backward()
    208 
    209     """

~/dl_venv/lib/python3.6/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    104                 products. Defaults to ``False``.
    105         """
--> 106         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    107 
    108     def register_hook(self, hook):

~/dl_venv/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     91     Variable._execution_engine.run_backward(
     92         tensors, grad_tensors, retain_graph, create_graph,
---> 93         allow_unreachable=True)  # allow_unreachable flag
     94 
     95 

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

ngimel commented 5 years ago

Thank you for your investigation and repro @FabianIsensee. It's a cudnn bug present at least in cudnn 7.4 and 7.5. I'm trying to get information on workarounds. The failing call is most likely for decoder 4 convolution weight gradient (input size 1,30,224,224,224, output features = 3, filter size =1) - I'm certain on sizes and it being weight gradient, not exactly sure if it's decoder 4 or not.

FabianIsensee commented 5 years ago

Thank you @ngimel for investigating this! If there is anything I can do to help, I'd be happy to do so. This bug is hindering some experiments and it would also be in my interest to get it solved relatively quickly

Olloxan commented 5 years ago

Hi, I get this error, when I want to call the .cuda() method on a network which contains a GRU. I have a Windows system with Pytorch 1.0.1, CUDA 10 and cudnn 7.401

mruberry commented 5 years ago

We are working on a fix for this. Will update ASAP.

FabianIsensee commented 5 years ago

That's great to hear, thank you so much!

CyFeng16 commented 5 years ago

Hi, I'm facing the same error. I receive RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error running on a multi-gpus server via NGC PyTorch docker image, while running the same code in my personal computer via PyTorch installed by conda gets success.

mruberry commented 5 years ago

@CyFeng16 can you post a code snippet that reproduces the error you're finding as well as instructions to run the snippet?

Also, which NGC PyTorch docker image are you using?

CyFeng16 commented 5 years ago

@mruberry Some additional info:

NGC docker within PyTorch version 1.1.0a0+be364ac failed.(on server) REPOSITORY TAG IMAGE ID CREATED SIZE nvcr.io/nvidia/pytorch 19.03-py3 697cd637fb1b 3 weeks ago 7.57GB
Conda env within PyTorch version 1.0.1.post2 run perfectly.(on PC)

Apologize that could not share whole codes for now. Code starts as follows:

# Random seed setting
torch.manual_seed(16)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

mruberry commented 5 years ago

Thank you for reporting the issue, but without a snippet to reproduce the error you're seeing I don't know if we'll be able to fix your issue. While you're seeing the same error, the cause may be different.

CyFeng16 commented 5 years ago

:label: will report when I find an alternative way to solve the problem.

[updates]

@mruberry After changing the docker image from 19.03 to 19.01 which contains PyTorch ver 1.0.0a0+056cfaf, the code works fine. Cheers.

[configurations]

OS: Ubuntu 18.10(PC)
CUDA Version: 10.0.130(PC)
cuDNN Version: 7.5(PC)
GPU models: 1080 Ti && 2080 Ti(PC) V100(DGX Server)

[summary]

1.0.0a0+056cfaf used via NGC image 19.01 worked.
1.0.1.post2 installed via conda worked.
1.1.0a0+be364ac used via NGC image 19.03 failed.

It's a pleasure if this could help you work on the fix.

mruberry commented 5 years ago

We believe we have identified the root cause of this issue and are working on a fix in cuDNN.

code-de commented 5 years ago

Hey @mruberry, any updates so far?

mruberry commented 5 years ago

We are testing a fix that we expect will ship in a future version of cuDNN. Unfortunately I cannot be more specific than that at the moment.

FingerRec commented 5 years ago

In pytorch1.1, code works well

soumith commented 5 years ago

thanks for reporting @FingerRec

CyFeng16 commented 5 years ago

@FingerRec glad to hear!

ksarma commented 5 years ago

I'm still able to reproduce what @FabianIsensee posted using pytorch 1.1.0a0+9eb0f43 (NGC 19.04) (using his snippet)

This NGC image only uses cuDNN 7.5.0, not sure if 7.5.1 or 7.6 would fix

ngimel commented 5 years ago

@ksarma please use official pytorch 1.1 conda/pip packages or image from dockerhub/pytorch.

ksarma commented 5 years ago

@ngimel Apologies, was using what I thought was the latest pytorch image from NGC, but it turns out there was a new one yesterday (19.05) and the issue seems to be fixed :) This image has cuDNN 7.6.0 and pytorch 1.1.0a0+828a6a3

JZPeterPan commented 3 years ago

could also be a compatible problem of torch and cudnn, this could help: https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation/issues/12

rave974 commented 3 years ago

can you confirm that v1.0.0 works fine?

WHAT IS THE REASON OF THIS ERROR?

Charlyo commented 3 years ago

Still happening on torch==1.9.0+cu111 and cuda 11.4.

niceblue88 commented 3 years ago

Yes still broken, at least for me. Pytorch 1.9, Cuda 11 and using Titan X in Windows 10. Benchmark = false works no problem reliably. Benchmark = true works for a bit (with double the speed of = false, which is great), but then randomly breaks for no seeming reason with CUDA ERROR. The fact that is works very well for a bit shows it can work very well (with 2x speed up), but clearly there is a bug somewhere that's been in there for at least 2 years now.

cgsaxner commented 3 years ago

Yes still broken, at least for me. Pytorch 1.9, Cuda 11 and using Titan X in Windows 10. Benchmark = false works no problem reliably. Benchmark = true works for a bit (with double the speed of = false, which is great), but then randomly breaks for no seeming reason with CUDA ERROR. The fact that is works very well for a bit shows it can work very well (with 2x speed up), but clearly there is a bug somewhere that's been in there for at least 2 years now.

I'm running into the same issue using PyTorch 1.9.1, CUDA 10.2 and CuDNN 7.6.5 on Quadro RTX 8000.

ShoufaChen commented 2 years ago

The latest pytorch version: 1.10.1+cu111, cudnn8.0.5 still happens on both V100 and A100 GPUs. It works well on 3090.

Setting torch.backends.cudnn.benchmark = False is a workarond.

ngimel commented 2 years ago

Please provide a self contained reproducible script triggering the problem. Pytorch 1.10.1 comes with cudnn 8.2, so I'm not sure why you are listing cudnn8.0.5

niceblue88 commented 2 years ago

Please also have a look at my response on this problem in a related thread https://github.com/pytorch/pytorch/issues/45769#issuecomment-936316009

ngimel commented 2 years ago

It doesn't have self contained reproduucible script. Without it, the bug is not actionable for us.

ShoufaChen commented 2 years ago

@ngimel Thanks for your reply. For cudnn version, I use the https://anaconda.org/pytorch/pytorch/1.10.1/download/linux-64/pytorch-1.10.1-py3.8_cuda11.1_cudnn8.0.5_0.tar.bz2 . It seems that Pytorch1.10.1 also comes with cudnn 8.0.5.

I encounter this issue when using this repo. The related issue is https://github.com/lulutang0608/Point-BERT/issues/14#issuecomment-1003677580.

rui-ren commented 2 years ago

@ShoufaChen Hi, I am running some experiments to test the K80 gpu and got the same error as you posted here https://github.com/lulutang0608/Point-BERT/issues/14#issuecomment-1003677580. Did you solve this error? Thanks.

rave974 commented 2 years ago

the solution is to run inference in a loop, when error comes, run program again, resume until finish

ShoufaChen commented 2 years ago

@Rui-Ren I am sorry that I didn't fix the problem. I just changed the pytorch version.

rui-ren commented 2 years ago

Ok, thanks, I will add some retries there and see how it goes.

@Rui-Ren I am sorry that I didn't fix the problem. I just changed the pytorch version.

Thanks you! I am using pytorch 1.11.0, and will try different pytorch version.

GuoleiSun commented 2 years ago

Hello, I faced the problem. It is very strange.

I faced the problem when my code is running on A100 with a specific batch size (2) and with 4 GPUs training. The code runs well on RTX 6000 and V100. It also works well if I use a different batch size (1,3,4) or different number of GPUs (1,2,3) on A100.
Here is the shown error "runner.outputs['loss'].backward() File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED"
Setting torch.backends.cudnn.benchmark = False works for me as well.

pytorch / pytorch

cuDNN error: CUDNN_STATUS_INTERNAL_ERROR on cudnn.benchmark = True #16831

Additional context