When I train multispeaker get RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

cliuxinxin commented 1 year ago

Hello:

Thank you so much for your great work.

When I trained the one-man model, everything was fine. But when I was training multiplayer models. An error is displayed.

===============================================

./aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [0,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "train_ms.py", line 296, in main() File "train_ms.py", line 50, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/root/liuxinxin/vits/train_ms.py", line 120, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/root/liuxinxin/vits/train_ms.py", line 148, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths, speakers) File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/root/liuxinxin/vits/models.py", line 467, in forward z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g) File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/root/liuxinxin/vits/models.py", line 236, in forward x = self.pre(x) x_mask File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/root/liuxinxin/v_env/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([1, 513, 1, 526], dtype=torch.half, device='cuda', requires_grad=True) net = torch.nn.Conv2d(513, 192, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().half() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams memory_format = Contiguous data_type = CUDNN_DATA_HALF padding = [0, 0, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0xa6494bc0 type = CUDNN_DATA_HALF nbDims = 4 dimA = 1, 513, 1, 526, strideA = 269838, 526, 526, 1, output: TensorDescriptor 0xa6495e50 type = CUDNN_DATA_HALF nbDims = 4 dimA = 1, 192, 1, 526, strideA = 100992, 526, 526, 1, weight: FilterDescriptor 0xa648fa90 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 192, 513, 1, 1, Pointer addresses: input: 0x7f4112800000 output: 0x7f41129b4a00 weight: 0x7f4112984800

======================================

I only have a 3090GPU

Any relevant advice would be greatly appreciated.

I have tried:

Replace the driver
Replace the CUDA
Replace the pytroch version
Lower the number of num_workers

It doesn't work

CookiePPP commented 1 year ago

What is your n_speakers parameter set to?

https://github.com/jaywalnut310/vits/blob/1eef52ed50743f77fca9ff6773ba673497f6bf9d/configs/vctk_base.json#L32

Check that the number if high enough for your dataset.

cliuxinxin commented 1 year ago

@CookiePPP

Thank you for your valuable advice.

I have three voices in my data.

I'm using the ids 1,2,3

And then I wrote 3 in n_speaker

According to your suggestion, I'll change it to 5

It started to work.

jaywalnut310 / vits

When I train multispeaker get RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #107