jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
MIT License
6.52k stars 1.21k forks source link

Multi speaker training error #58

Open kumdori88 opened 2 years ago

kumdori88 commented 2 years ago

Hi, I am trying to train a multi-speaker, but when I run "train_ms.py" I get the following error:

[INFO] {'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 10000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 32, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/korean_audio_text_train_filelist_suffle.txt.cleaned', 'validation_files': 'filelists/korean_audio_text_val_filelist_suffle.txt.cleaned', 'text_cleaners': ['korean_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 12, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256}, 'model_dir': './logs/korean_base'} [WARNING] /data/tts/vits_modify_ver2 is not a git repository, therefore hash value comparison will be ignored. ./logs/korean_base/G_0.pth ./logs/korean_base/G_0.pth THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [4,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed. (skip)

/opt/conda/lib/python3.7/site-packages/torch/functional.py:516: UserWarning: stft will require the return_complex parameter be explicitly specified in a future PyTorch release. Use return_complex=False to preserve the current behavior or return_complex=True to return a complex output. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:653.) normalized, onesided, return_complex) /opt/conda/lib/python3.7/site-packages/torch/functional.py:516: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at /pytorch/aten/src/ATen/native/SpectralOps.cpp:590.) normalized, onesided, return_complex) terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8 Traceback (most recent call last): File "train_ms.py", line 294, in main() File "train_ms.py", line 50, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/data/tts/vits_modify_ver2/train_ms.py", line 118, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data/tts/vits_modify_ver2/train_ms.py", line 146, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths, speakers) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], *kwargs[0]) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/data/tts/vits_modify_ver2/models.py", line 467, in forward z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/data/tts/vits_modify_ver2/models.py", line 235, in forward x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype) File "/data/tts/vits_modify_ver2/commons.py", line 125, in sequence_mask return x.unsqueeze(0) < length.unsqueeze(1) File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 27, in wrapped return f(args, **kwargs) RuntimeError: CUDA error: device-side assert triggered

/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown len(cache))

Single-speaker training works fine, but when I add "gin_channels" and start multi-speaker training, I get an error. computer specifications are GPU: RTX 3090 x2 CUDA: 11.1 Pytorch: 1.7.1+cu110

How can I solve this?

cantabile-kwok commented 2 years ago

Probably you started the speaker index with 1 instead of 0

JohnHerry commented 1 year ago

Probably you started the speaker index with 1 instead of 0

The same problem. Yes, the speaker id should be start with 0. not good program design.

daniilrobnikov commented 1 year ago

Faced the same problem, it seems that speakers should be indexed starting from 0