Training code error - Githubissues

WendongGan commented 1 year ago

Hi, p0p4k, Thanks for sharing the code. It's a great project. I have been following this project for a long time and have tried it many times. But when I run the training code, I still get the following error. I guess some parameters were passed wrong in the code. The actual parameters are not completely obtained from vits2_ljs_base.json. I tried to debug and modify it, but it didn't work. Look forward to your review and reply.

when i run : python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 157, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 191, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 748, in forward z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 495, in forward x = self.pre(x) * x_mask File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [192, 80, 1], expected input[32, 513, 298] to have 80 channels, but got 513 channels instead

WendongGan commented 1 year ago

If any of you have solved this problem, I look forward to sharing your solutions. Thank you very much!

p0p4k commented 1 year ago

Hello, I made a really silly mistake. Please try the latest patch and let me know. In train.py, I was supposed to modify hps.data.use_mel_posterior_encoder based on hps.model.use_mel_posterior_encoder before passing hps.data to the dataloader. However, I loaded the dataloader first, which generates Linear-Spectrograms of 513 channels and then the model paramters load a model that accepts Mel-Spectrograms of 80 channels, and then I modify the hps.data params (which is never used; since dataloader already loaded). I fixed the order and also added an additional flag in hps.data just to be sure for now. I will do a clean up to avoid model and data parameters mismatch (minor stuff) later on. Thanks.

WendongGan commented 1 year ago

Thank you very much. I will try this latest code and report back the results。

p0p4k commented 1 year ago

@UESTCgan ~wait i think there is minor bug yet. Fixing it now.~ Fixed multi-speaker loader as well. Should be good to go.

WendongGan commented 1 year ago

I tried the latest code, just submitted an hour ago, ca7e41d.

when i run : python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error:

Traceback (most recent call last): File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 338, in main() File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 158, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 194, in train_and_evaluate mel = spec_to_mel_torch( File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/mel_processing.py", line 85, in spec_to_mel_torch spec = torch.matmul(mel_basis[fmax_dtype_device], spec) RuntimeError: mat1 and mat2 shapes cannot be multiplied (9536x80 and 513x80)

p0p4k commented 1 year ago

Haha, of course. I am making so many silly mistakes. Fixing it right now.

p0p4k commented 1 year ago

Fixed. @UESTCgan Really thanks for letting me know the errors. These feedbacks are really helpful! Let's get the model working ASAP!

p0p4k commented 1 year ago

explanation for the bug : After generating the wav output (wav_pred), the model converts the wav_pred to mel-spec for comparing with mel-spec of wav_real. The mel-spec is obtained from lin-spec used as input in VITS-1 model. However, in VITS-2 we use mel-spec and so the bug occurs in trying to convert mel-spec to mel-spec (????). We must directly use the mel-spec that was input in the model and compare with wav_pred's mel-spec.

WendongGan commented 1 year ago

I tried the latest code, ee1c94d

when i run : python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 158, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 192, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 776 777 778 779 ... In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

WendongGan commented 1 year ago

Currently I am using pytorch==1.13. Do I have to use Pytorch version 2.0?

p0p4k commented 1 year ago

Can you try "use_noise_scaled_mas=False" in the config and run the training? Thanks.

WendongGan commented 1 year ago

when i set "use_noise_scaled_mas=False" in the config

i meet the error:

Traceback (most recent call last): File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 344, in main() File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 126, in run mas_noise_scale_initial = mas_noise_scale_initial, UnboundLocalError: local variable 'mas_noise_scale_initial' referenced before assignment

p0p4k commented 1 year ago

Updated. Thanks.

p0p4k commented 1 year ago

I am downloading data and trying to train one step and check the previous error regarding loss.

WendongGan commented 1 year ago

Currently I am using pytorch==1.13. Do I have to use Pytorch version 2.0?

This issue does not seem to be related to pytorch version. I still have this problem with pytorch 2.0.

p0p4k commented 1 year ago

But after the latest update and use_noise_scaled_mas=False ?

WendongGan commented 1 year ago

But after the latest update and use_noise_scaled_mas=False ?

Traceback (most recent call last): File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 345, in main() File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 128, in run noise_scale_delta = noise_scale_delta, UnboundLocalError: local variable 'noise_scale_delta' referenced before assignment

p0p4k commented 1 year ago

Check again. Thanks.

WendongGan commented 1 year ago

Check again. Thanks.

when i try 30adb2d

i meet the error: Traceback (most recent call last): File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 346, in main() File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 160, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 194, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 776 777 778 779 ... In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

p0p4k commented 1 year ago

Okay. Then give me 2 hours, I will fix the bug and let you know. Thanks.

WendongGan commented 1 year ago

Okay. Then give me 2 hours, I will fix the bug and let you know. Thanks.

Thank you very much! Looking forward to your update.

p0p4k commented 1 year ago

I added the findunusedparams thingy. Tell me what the error says now. That can help me update the code. Thanks.

WendongGan commented 1 year ago

after trained some steps:

i meet the error:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 161, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 244, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([16, 192, 1, 123], dtype=torch.half, device='cuda', requires_grad=True) net = torch.nn.Conv2d(192, 768, kernel_size=[1, 3], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().half() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams memory_format = Contiguous data_type = CUDNN_DATA_HALF padding = [0, 0, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0x562a218aa350 type = CUDNN_DATA_HALF nbDims = 4 dimA = 16, 192, 1, 123, strideA = 23616, 123, 123, 1, output: TensorDescriptor 0x7f544aefb2e0 type = CUDNN_DATA_HALF nbDims = 4 dimA = 16, 768, 1, 121, strideA = 92928, 121, 121, 1, weight: FilterDescriptor 0x7f544aefaa60 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 768, 192, 1, 3, Pointer addresses: input: 0x7f546e400000 output: 0x7f5685917000 weight: 0x7f54676b5800

WendongGan commented 1 year ago

i am trying pytorch 2.0. maybe， it works.

WendongGan commented 1 year ago

I'm training LJSpeech. If I have some results tomorrow, I'll give them back. Thanks again for the update!

p0p4k commented 1 year ago

Hello, does the training work well now? And can you post me your config file? Do you have discord? (add me on discord : p0p4k)

p0p4k commented 1 year ago

Hi, I tried using Pytorch==1.13.1 and the training worked for me. I suggest using the same version.

WendongGan commented 1 year ago

Hello, does the training work well now? And can you post me your config file? Do you have discord? (add me on discord : p0p4k)

https://github.com/p0p4k/vits2_pytorch/commit/3c5b155d97468f8e38e44e39ca42ede41f85da3f I train in pytorch 2.0.1, there is no problem. i did not change configure

WendongGan commented 1 year ago

p0p4k

ok

WendongGan commented 1 year ago

Hello, does the training work well now? And can you post me your config file? Do you have discord? (add me on discord : p0p4k)

my discord is hepanqingge#8740, i have added you.

p0p4k / vits2_pytorch

Training code error #9