tsurumeso / vocal-remover

Vocal Remover using Deep Neural Networks
MIT License
1.47k stars 215 forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #128

Closed RuolinZheng08 closed 1 year ago

RuolinZheng08 commented 1 year ago

I tried to run inference.py and encountered RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR. I'm using Python3.7 per #93 on a Linux machine. I have a 8G VRAM GPU so I don't know whether this is a memory error.

Full error log:

loading model... done
loading wave source... done
stft of wave source... done
  0%|          | 0/19 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "inference.py", line 184, in <module>
    main()
  File "inference.py", line 156, in main
    y_spec, v_spec = sp.separate(X_spec)
  File "inference.py", line 77, in separate
    mask = self._separate(X_mag_pad, roi_size)
  File "inference.py", line 44, in _separate
    pred = self.model.predict_mask(X_batch)
  File "/home/user/vocal-remover/lib/nets.py", line 115, in predict_mask
    mask = self.forward(x)
  File "/home/user/vocal-remover/lib/nets.py", line 89, in forward
    h2 = self.stg2_high_band_net(h2_in)
  File "/home/user/vocal-remover/lib/nets.py", line 33, in __call__
    h = self.aspp(e5)
  File "/scratch/user/conda/envs/vocal-remover/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/vocal-remover/lib/layers.py", line 121, in forward
    feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
  File "/scratch/user/conda/envs/vocal-remover/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/scratch/user/conda/envs/vocal-remover/lib/python3.7/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/user/vocal-remover/lib/layers.py", line 26, in __call__
    return self.conv(x)
  File "/scratch/user/conda/envs/vocal-remover/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/scratch/user/conda/envs/vocal-remover/lib/python3.7/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/scratch/user/conda/envs/vocal-remover/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/scratch/user/conda/envs/vocal-remover/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/scratch/user/conda/envs/vocal-remover/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 396, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 128, 1, 16], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_FLOAT
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x641b5d0
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 4, 128, 1, 16, 
    strideA = 2048, 16, 16, 1, 
output: TensorDescriptor 0x641b560
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 4, 128, 1, 16, 
    strideA = 2048, 16, 16, 1, 
weight: FilterDescriptor 0x597c900
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 128, 128, 1, 1, 
Pointer addresses: 
    input: 0x7f00593e8e00
    output: 0x7f00593f0e00
    weight: 0x7f0058d18000
Forward algorithm: 5

Per the suggestion in the error log, I tried running this snippet in Python and got no error so my torch installation should be alright.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 128, 1, 16], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

Is there anything I'm missing? Thanks!

RuolinZheng08 commented 1 year ago

Problem seemed to be solved after installing the lastest possible version of torch supported by my specific CUDA version.