[Help]: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

mysxs commented 2 months ago

Problem Overview

Hello, I changed the Amohion code, added features, and added the corresponding encoder. The input feature dimension is 1583, and the output feature dimension is 384. The encoder is as follows:

class EmotionEncoder(nn.Module):
    def __init__(self, cfg):
        super(EmotionEncoder, self).__init__()
        self.input_dim = cfg.input_emotion_feature_dim  # 1583
        self.output_dim = cfg.output_emotion_feature_dim  # 384
        self.embedding = nn.Embedding(self.input_dim, self.output_dim, padding_idx=None,)

    def forward(self, x)
        x = x.to(torch.long)
        embedded = self.embedding(x)
        print('embedded.shape:',embedded.shape) #torch.Size([64, 136, 384])
        return embedded

Then the error is as follows. I checked and it seems that there is a problem with embedded id. How can I solve it?

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [609,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Traceback (most recent call last):
  File "/home/sxs/workspace/code/Amphion/bins/svc/train.py", line 120, in <module>
    main()
  File "/home/sxs/workspace/code/Amphion/bins/svc/train.py", line 116, in main
    trainer.train_loop()
  File "/home/sxs/workspace/code/Amphion/models/base/new_trainer.py", line 257, in train_loop
    train_loss = self._train_epoch()
  File "/home/sxs/workspace/code/Amphion/models/base/new_trainer.py", line 373, in _train_epoch
    loss = self._train_step(batch)
  File "/home/sxs/workspace/code/Amphion/models/base/new_trainer.py", line 429, in _train_step
    return self._forward_step(batch)
  File "/home/sxs/workspace/code/Amphion/models/svc/diffusion/diffusion_trainer.py", line 132, in _forward_step
    y_pred = self.acoustic_mapper(noisy_mel, timesteps, conditioner)
  File "/home/sxs/anaconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sxs/workspace/code/Amphion/models/svc/diffusion/diffusion_wrapper.py", line 144, in forward
    h = self.neural_network(x, t, c)
  File "/home/sxs/anaconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sxs/workspace/code/Amphion/modules/diffusion/bidilconv/bidilated_conv.py", line 107, in forward
    h = self.input(x)
  File "/home/sxs/anaconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sxs/anaconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/sxs/anaconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sxs/anaconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/sxs/anaconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Screenshots

Amphion/modules/diffusion/bidilconv/bidilated_conv.py：

RMSnow commented 2 months ago

Hi @mysxs, thanks for considering Amphion as your codebase!

I also met the RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error. However, usually the error is raised by a wrong input value for Embedding layer or CrossEntroyLoss (see here). I recommend you check the range of x of your EmotionEncoder, to see whether there is some vocab ids which not ranges from (0, 1582).

By the way, for nn.Embedding, the first parameter should be the number of vocabulary instead of the input dim (now it is 1583 in your code)?

The best way to debug for cuDNN error is to use CPU to run your code again. This is because on GPU, the traceback report is not reliable and the bug position can not be accurately found.

mysxs commented 2 months ago

Hi @mysxs, thanks for considering Amphion as your codebase!

I also met the RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error. However, usually the error is raised by a wrong input value for Embedding layer or CrossEntroyLoss (see here). I recommend you check the range of x of your EmotionEncoder, to see whether there is some vocab ids which not ranges from (0, 1582).

By the way, for nn.Embedding, the first parameter should be the number of vocabulary instead of the input dim (now it is 1583 in your code)?

The best way to debug for cuDNN error is to use CPU to run your code again. This is because on GPU, the traceback report is not reliable and the bug position can not be accurately found.

谢谢同学，问题解决了，你们真肝，好及时回复问题，辛苦了！

open-mmlab / Amphion

[Help]: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #193

Problem Overview

Screenshots