Closed moderatelyfunctional closed 5 years ago
can you confirm that v1.0.0 works fine?
It also fails for v1.0.0. I installed it via pip3.
in that case, this is pretty weird. It shouldn't be failing on a Titan-X. I see that you have CUDA10 installed on the system. Can you try installing our CUDA10 wheel? The command is at https://pytorch.org (use the selector wizard in get started and select pip and cuda10)
I uploaded new 1.0.1.post2 binaries, can you give those a shot?
Same problem on current official 1.0.1 when pushing batch size to more than 3 per gpu (on multi gpu context). Works when batch_size/ngpu <= 3. Problem only with Group norm layers
I'd like to investigate this, any chance you can give me a small code snippet that's failing?
I have the exact same problem. It works on pytorch 0.4.1 (cuDNN 7104) It does not work on pytorch 1.0.1.post2 (current stable release, cuDNN 7402) It does not work on pytorch 1.0.0.dev20190220 (current nightly, cuDNN 7402)
My model is a very large 3D UNet that takes 224x224x224 shaped inputs. If you want to try it you will need 32GB of VRAM. It will not crash with smaller inputs such as 128x128x128 (which you could fit on your regular 12GB card).
Run this snippet to reprocude:
from torch.backends import cudnn
from copy import deepcopy
from torch import nn
import torch
import numpy as np
import torch.nn.functional
class ConvDropoutNormNonlin(nn.Module):
def __init__(self, input_channels, output_channels,
conv_op=nn.Conv2d, conv_kwargs=None,
norm_op=nn.BatchNorm2d, norm_op_kwargs=None,
dropout_op=nn.Dropout2d, dropout_op_kwargs=None,
nonlin=nn.LeakyReLU, nonlin_kwargs=None):
super(ConvDropoutNormNonlin, self).__init__()
if nonlin_kwargs is None:
nonlin_kwargs = {'negative_slope': 1e-2, 'inplace': True}
if dropout_op_kwargs is None:
dropout_op_kwargs = {'p': 0.5, 'inplace': True}
if norm_op_kwargs is None:
norm_op_kwargs = {'eps': 1e-5, 'affine': True, 'momentum': 0.1}
if conv_kwargs is None:
conv_kwargs = {'kernel_size': 3, 'stride': 1, 'padding': 1, 'dilation': 1, 'bias': True}
self.nonlin_kwargs = nonlin_kwargs
self.nonlin = nonlin
self.dropout_op = dropout_op
self.dropout_op_kwargs = dropout_op_kwargs
self.norm_op_kwargs = norm_op_kwargs
self.conv_kwargs = conv_kwargs
self.conv_op = conv_op
self.norm_op = norm_op
self.conv = self.conv_op(input_channels, output_channels, **self.conv_kwargs)
if self.dropout_op is not None and self.dropout_op_kwargs['p'] is not None and self.dropout_op_kwargs[
'p'] > 0:
self.dropout = self.dropout_op(**self.dropout_op_kwargs)
else:
self.dropout = None
self.instnorm = self.norm_op(output_channels, **self.norm_op_kwargs)
self.lrelu = nn.LeakyReLU(**self.nonlin_kwargs)
def forward(self, x):
x = self.conv(x)
if self.dropout is not None:
x = self.dropout(x)
return self.lrelu(self.instnorm(x))
class StackedConvLayers(nn.Module):
def __init__(self, input_feature_channels, output_feature_channels, num_convs,
conv_op=nn.Conv2d, conv_kwargs=None,
norm_op=nn.BatchNorm2d, norm_op_kwargs=None,
dropout_op=nn.Dropout2d, dropout_op_kwargs=None,
nonlin=nn.LeakyReLU, nonlin_kwargs=None, first_stride=None):
self.input_channels = input_feature_channels
self.output_channels = output_feature_channels
if nonlin_kwargs is None:
nonlin_kwargs = {'negative_slope': 1e-2, 'inplace': True}
if dropout_op_kwargs is None:
dropout_op_kwargs = {'p': 0.5, 'inplace': True}
if norm_op_kwargs is None:
norm_op_kwargs = {'eps': 1e-5, 'affine': True, 'momentum': 0.1}
if conv_kwargs is None:
conv_kwargs = {'kernel_size': 3, 'stride': 1, 'padding': 1, 'dilation': 1, 'bias': True}
self.nonlin_kwargs = nonlin_kwargs
self.nonlin = nonlin
self.dropout_op = dropout_op
self.dropout_op_kwargs = dropout_op_kwargs
self.norm_op_kwargs = norm_op_kwargs
self.conv_kwargs = conv_kwargs
self.conv_op = conv_op
self.norm_op = norm_op
if first_stride is not None:
self.conv_kwargs_first_conv = deepcopy(conv_kwargs)
self.conv_kwargs_first_conv['stride'] = first_stride
else:
self.conv_kwargs_first_conv = conv_kwargs
super(StackedConvLayers, self).__init__()
self.blocks = nn.Sequential(
*([ConvDropoutNormNonlin(input_feature_channels, output_feature_channels, self.conv_op,
self.conv_kwargs_first_conv,
self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs,
self.nonlin, self.nonlin_kwargs)] +
[ConvDropoutNormNonlin(output_feature_channels, output_feature_channels, self.conv_op,
self.conv_kwargs,
self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs,
self.nonlin, self.nonlin_kwargs) for _ in range(num_convs - 1)]))
def forward(self, x):
return self.blocks(x)
class Generic_UNet(nn.Module):
def __init__(self, input_channels, base_num_features, num_classes, num_pool):
super(Generic_UNet, self).__init__()
self.nonlin_kwargs = {'negative_slope':1e-2, 'inplace':True}
self.dropout_op_kwargs = {'p':0.0, 'inplace':True}
self.norm_op_kwargs = {'eps':1e-5, 'affine':True}
self.conv_kwargs = {'kernel_size':3,'padding':1, 'stride':1, 'dilation':1, 'bias':True}
self.nonlin = nn.ReLU
self.conv_op = nn.Conv3d
self.norm_op = nn.InstanceNorm3d
self.dropout_op = nn.Dropout3d
self.num_classes = num_classes
transpconv = nn.ConvTranspose3d
self.conv_blocks_encoder = []
self.conv_blocks_decoder = []
self.transpConvs = []
# encoder
output_features = base_num_features
input_features = input_channels
for d in range(num_pool):
# determine the first stride
if d != 0:
first_stride = 2
else:
first_stride = 1
self.conv_blocks_encoder.append(StackedConvLayers(input_features, output_features, 2,
self.conv_op, self.conv_kwargs, self.norm_op,
self.norm_op_kwargs, self.dropout_op,
self.dropout_op_kwargs, self.nonlin, self.nonlin_kwargs,
first_stride))
input_features = output_features
output_features = int(np.round(output_features * 2))
output_features = min(output_features, 480) # no more filters, otherwise we explode in num parameters
# now the bottleneck.
first_stride = 2
final_num_features = output_features
self.conv_blocks_encoder.append(nn.Sequential(
StackedConvLayers(input_features, output_features, 2 - 1, self.conv_op, self.conv_kwargs,
self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs, self.nonlin,
self.nonlin_kwargs, first_stride),
StackedConvLayers(output_features, final_num_features, 1, self.conv_op, self.conv_kwargs,
self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs, self.nonlin,
self.nonlin_kwargs)))
# now lets build the decoder pathway
for u in range(num_pool):
nfeatures_from_down = final_num_features
nfeatures_from_skip = self.conv_blocks_encoder[-(2 + u)].output_channels # self.conv_blocks_context[-1] is bottleneck, so start with -2
n_features_after_tu_and_concat = nfeatures_from_skip * 2
final_num_features = nfeatures_from_skip
self.transpConvs.append(transpconv(nfeatures_from_down, nfeatures_from_skip, 2, 2, bias=False))
self.conv_blocks_decoder.append(nn.Sequential(
StackedConvLayers(n_features_after_tu_and_concat, nfeatures_from_skip, 2 - 1,
self.conv_op, self.conv_kwargs, self.norm_op, self.norm_op_kwargs, self.dropout_op,
self.dropout_op_kwargs, self.nonlin, self.nonlin_kwargs),
StackedConvLayers(nfeatures_from_skip, final_num_features, 1, self.conv_op, self.conv_kwargs,
self.norm_op, self.norm_op_kwargs, self.dropout_op, self.dropout_op_kwargs,
self.nonlin, self.nonlin_kwargs)
))
self.seg_output = self.conv_op(base_num_features, num_classes, 1, 1, 0, 1, 1, False)
# register all modules properly
self.conv_blocks_decoder = nn.ModuleList(self.conv_blocks_decoder)
self.conv_blocks_encoder = nn.ModuleList(self.conv_blocks_encoder)
self.transpConvs = nn.ModuleList(self.transpConvs)
def forward(self, x):
skips = []
for d in range(len(self.conv_blocks_encoder) - 1):
x = self.conv_blocks_encoder[d](x)
skips.append(x)
x = self.conv_blocks_encoder[-1](x)
for u in range(len(self.transpConvs)):
x = self.transpConvs[u](x)
x = torch.cat((x, skips[-(u + 1)]), dim=1)
x = self.conv_blocks_decoder[u](x)
seg_output = self.seg_output(x)
return seg_output
if __name__ == "__main__":
cudnn.benchmark = True
net = Generic_UNet(1, 30, 3, 5).cuda()
a = torch.rand((1, 1, 224, 224, 224)).pin_memory().cuda()
res = net(a)
loss = res.sum()
loss.backward()
Set cudnn.benchmark = False
and this example will run on all pytorch versions.
I hope you can help. This seems to be a tricky one... :-)
Best, Fabian
Edit: I printed the filter map sizes for your convenience:
encoder 0 x.shape: torch.Size([1, 30, 224, 224, 224]) encoder 1 x.shape: torch.Size([1, 60, 112, 112, 112]) encoder 2 x.shape: torch.Size([1, 120, 56, 56, 56]) encoder 3 x.shape: torch.Size([1, 240, 28, 28, 28]) encoder 4 x.shape: torch.Size([1, 480, 14, 14, 14]) bottleneck, x.shape torch.Size([1, 480, 7, 7, 7]) transpconv 0 x.shape torch.Size([1, 480, 14, 14, 14]) decoder 0 x.shape torch.Size([1, 480, 14, 14, 14]) transpconv 1 x.shape torch.Size([1, 240, 28, 28, 28]) decoder 1 x.shape torch.Size([1, 240, 28, 28, 28]) transpconv 2 x.shape torch.Size([1, 120, 56, 56, 56]) decoder 2 x.shape torch.Size([1, 120, 56, 56, 56]) transpconv 3 x.shape torch.Size([1, 60, 112, 112, 112]) decoder 3 x.shape torch.Size([1, 60, 112, 112, 112]) transpconv 4 x.shape torch.Size([1, 30, 224, 224, 224]) decoder 4 x.shape torch.Size([1, 30, 224, 224, 224]) segmentation_output.shape torch.Size([1, 3, 224, 224, 224])
I also did some additional experiments:
In [2]: torch.__version__ Out[2]: '1.0.0.dev20190220 32x32x32 -> works 64x64x64 -> works 96x96x96 -> works 128x128x128 -> works 160x160x160 -> works 192x192x192 -> works 224x224x224 -> CUDNN_STATUS_INTERNAL_ERROR 256x256x256 -> works (you need to use 20 base_num_features, otherwise it's too large) 224x224x224 with 20 base_num_features -> CUDNN_STATUS_INTERNAL_ERROR 224x224x224 with 24 base_num_features -> CUDNN_STATUS_INTERNAL_ERROR 224x224x224 with 4 base_num_features -> CUDNN_STATUS_INTERNAL_ERROR 224x224x224 with 32 base_num_features -> CUDNN_STATUS_INTERNAL_ERROR cudnn.benchmark = False -> works
Interestingly the error only appears on 224x224x224 and not on smaller or larger inputs. I am running the 410.79 driver and Cuda-10.0. All experiments done on a V100 32GB card.
The error message is this one:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-1-2db53cb1edfd> in <module>
205 loss = res.sum()
206
--> 207 loss.backward()
208
209 """
~/dl_venv/lib/python3.6/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
104 products. Defaults to ``False``.
105 """
--> 106 torch.autograd.backward(self, gradient, retain_graph, create_graph)
107
108 def register_hook(self, hook):
~/dl_venv/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
91 Variable._execution_engine.run_backward(
92 tensors, grad_tensors, retain_graph, create_graph,
---> 93 allow_unreachable=True) # allow_unreachable flag
94
95
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
Thank you for your investigation and repro @FabianIsensee. It's a cudnn bug present at least in cudnn 7.4 and 7.5. I'm trying to get information on workarounds. The failing call is most likely for decoder 4 convolution weight gradient (input size 1,30,224,224,224, output features = 3, filter size =1) - I'm certain on sizes and it being weight gradient, not exactly sure if it's decoder 4 or not.
Thank you @ngimel for investigating this! If there is anything I can do to help, I'd be happy to do so. This bug is hindering some experiments and it would also be in my interest to get it solved relatively quickly
Hi, I get this error, when I want to call the .cuda() method on a network which contains a GRU. I have a Windows system with Pytorch 1.0.1, CUDA 10 and cudnn 7.401
We are working on a fix for this. Will update ASAP.
That's great to hear, thank you so much!
Hi, I'm facing the same error.
I receive RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
error running on a multi-gpus server via NGC PyTorch docker image, while running the same code in my personal computer via PyTorch installed by conda gets success.
@CyFeng16 can you post a code snippet that reproduces the error you're finding as well as instructions to run the snippet?
Also, which NGC PyTorch docker image are you using?
@mruberry Some additional info:
1.1.0a0+be364ac
failed.(on server)
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/pytorch 19.03-py3 697cd637fb1b 3 weeks ago 7.57GB1.0.1.post2
run perfectly.(on PC)Apologize that could not share whole codes for now. Code starts as follows:
# Random seed setting
torch.manual_seed(16)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Thank you for reporting the issue, but without a snippet to reproduce the error you're seeing I don't know if we'll be able to fix your issue. While you're seeing the same error, the cause may be different.
:label: will report when I find an alternative way to solve the problem.
[updates]
@mruberry After changing the docker image from 19.03
to 19.01
which contains PyTorch ver 1.0.0a0+056cfaf
, the code works fine. Cheers.
[configurations]
[summary]
1.0.0a0+056cfaf
used via NGC image 19.01
worked.1.0.1.post2
installed via conda worked.1.1.0a0+be364ac
used via NGC image 19.03
failed.It's a pleasure if this could help you work on the fix.
We believe we have identified the root cause of this issue and are working on a fix in cuDNN.
Hey @mruberry, any updates so far?
We are testing a fix that we expect will ship in a future version of cuDNN. Unfortunately I cannot be more specific than that at the moment.
In pytorch1.1, code works well
thanks for reporting @FingerRec
@FingerRec glad to hear!
I'm still able to reproduce what @FabianIsensee posted using pytorch 1.1.0a0+9eb0f43 (NGC 19.04) (using his snippet)
This NGC image only uses cuDNN 7.5.0, not sure if 7.5.1 or 7.6 would fix
@ksarma please use official pytorch 1.1 conda/pip packages or image from dockerhub/pytorch.
@ngimel Apologies, was using what I thought was the latest pytorch image from NGC, but it turns out there was a new one yesterday (19.05) and the issue seems to be fixed :) This image has cuDNN 7.6.0 and pytorch 1.1.0a0+828a6a3
could also be a compatible problem of torch and cudnn, this could help: https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation/issues/12
can you confirm that v1.0.0 works fine?
WHAT IS THE REASON OF THIS ERROR?
Still happening on torch==1.9.0+cu111 and cuda 11.4.
Yes still broken, at least for me. Pytorch 1.9, Cuda 11 and using Titan X in Windows 10. Benchmark = false works no problem reliably. Benchmark = true works for a bit (with double the speed of = false, which is great), but then randomly breaks for no seeming reason with CUDA ERROR. The fact that is works very well for a bit shows it can work very well (with 2x speed up), but clearly there is a bug somewhere that's been in there for at least 2 years now.
Yes still broken, at least for me. Pytorch 1.9, Cuda 11 and using Titan X in Windows 10. Benchmark = false works no problem reliably. Benchmark = true works for a bit (with double the speed of = false, which is great), but then randomly breaks for no seeming reason with CUDA ERROR. The fact that is works very well for a bit shows it can work very well (with 2x speed up), but clearly there is a bug somewhere that's been in there for at least 2 years now.
I'm running into the same issue using PyTorch 1.9.1, CUDA 10.2 and CuDNN 7.6.5 on Quadro RTX 8000.
The latest pytorch version: 1.10.1+cu111, cudnn8.0.5
still happens on both V100 and A100 GPUs. It works well on 3090.
Setting torch.backends.cudnn.benchmark = False
is a workarond.
Please provide a self contained reproducible script triggering the problem. Pytorch 1.10.1 comes with cudnn 8.2, so I'm not sure why you are listing cudnn8.0.5
Please also have a look at my response on this problem in a related thread https://github.com/pytorch/pytorch/issues/45769#issuecomment-936316009
It doesn't have self contained reproduucible script. Without it, the bug is not actionable for us.
@ngimel Thanks for your reply. For cudnn version, I use the https://anaconda.org/pytorch/pytorch/1.10.1/download/linux-64/pytorch-1.10.1-py3.8_cuda11.1_cudnn8.0.5_0.tar.bz2 . It seems that Pytorch1.10.1 also comes with cudnn 8.0.5.
I encounter this issue when using this repo. The related issue is https://github.com/lulutang0608/Point-BERT/issues/14#issuecomment-1003677580.
@ShoufaChen Hi, I am running some experiments to test the K80 gpu and got the same error as you posted here https://github.com/lulutang0608/Point-BERT/issues/14#issuecomment-1003677580. Did you solve this error? Thanks.
the solution is to run inference in a loop, when error comes, run program again, resume until finish
@Rui-Ren I am sorry that I didn't fix the problem. I just changed the pytorch version.
Ok, thanks, I will add some retries there and see how it goes.
@Rui-Ren I am sorry that I didn't fix the problem. I just changed the pytorch version.
Thanks you! I am using pytorch 1.11.0, and will try different pytorch version.
Hello, I faced the problem. It is very strange.
I faced the problem when my code is running on A100 with a specific batch size (2) and with 4 GPUs training. The code runs well on RTX 6000 and V100. It also works well if I use a different batch size (1,3,4) or different number of GPUs (1,2,3) on A100.
Here is the shown error "runner.outputs['loss'].backward() File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED"
Setting torch.backends.cudnn.benchmark = False works for me as well.
Hi, I'm running PyTorch on an encoder/decoder architecture and am having a problem with cuDNN.
If I include
in my Python code, then I receive an error for CUDNN_STATUS_INTERNAL_ERROR.
The full stack trace is listed here.
If I comment out cuDNN, I can run the code without any problems.
My system configurations are listed below.
Additional context
I've tried
rm -rf ~/.nv
, rebuilt CUDA/cuDNN and reinstalled PyTorch but still cannot get it to work. Thanks for your help!