p0p4k / vits2_pytorch

unofficial vits2-TTS implementation in pytorch
https://arxiv.org/abs/2307.16430
MIT License
471 stars 84 forks source link

Two errors in using noise MAS #4

Closed KdaiP closed 1 year ago

KdaiP commented 1 year ago

Thank you for your implementation on VITS2! I copied your noise MAS part into the original VITS and attempted to train it with multiple speakers across 2 GPUs. However, I encountered two errors in the process:

The first error arises when employing DDP (DistributedDataParallel), displaying the message: "AttributeError: 'DistributedDataParallel' object has no attribute 'net_g.mas_noise_scale_initial'". Solutions found online suggest using model.module instead of model. The code in train.py (line 181-182) may need modification as follows:

current_mas_noise_scale = net_g.module.mas_noise_scale_initial - net_g.module.noise_scale_delta * global_step
net_g.module.current_mas_noise_scale = max(current_mas_noise_scale, 0.0)

The second error originates in models.py (line 703): epsilon = torch.sum(logs_p, dim=1).exp() * torch.randn_like(neg_cent) * self.current_mas_noise_scale The error message is: "RuntimeError: The size of tensor 'a' must match the size of tensor 'b' at non-singleton dimension 1". Upon examining notebooks/MAS_with_noise.ipynb, the code appears to function correctly. Yet, altering the batch size from 1 to other values triggers the same error. I guess, when the batch size is not 1, the two tensors fail to meet the broadcasting condition. Introducing a dimension using unsqueeze(1) resolves the error (but I am unsure whether it is right). The code adjustment could be as follows: epsilon = torch.sum(logs_p, dim=1).exp().unsqueeze(1) * torch.randn_like(neg_cent) * self.current_mas_noise_scale

Upon introducing the noise, I observed a deteriorating trend in the MAS conclusion as the training steps increased. Additionally, the audio generated consisted mostly of silence and electric current-like sound. It seems that there might be some problem in the noise generation formula. image image image image

p0p4k commented 1 year ago

Thanks a lot for the detailed bug report. It will be helpful if you could send a PR for it and I will fix the codes according to your suggestions. I will go through the training script once again and try to debug and fix the batch errors. Maybe we can remove torch.sum(logs_p, dim=1).exp().unsqueeze(1) and try. I will do some experiments around that and update the README as well.

p0p4k commented 1 year ago

Also, @KdaiP it is possible that you can train using use_noise_scaled_mas=False flag in the config? It might be helpful to see which part of the model corresponds to how much improvement over base-VITS-1. Really thank you for your GPU resources!

p0p4k commented 1 year ago

Now that I think about it , we might have to use https://pytorch.org/docs/stable/generated/torch.std.html . I will update the code soon.

p0p4k commented 1 year ago

@KdaiP You may now try the new update and let me know about any training issues. Thanks.

KdaiP commented 1 year ago

@KdaiP You may now try the new update and let me know about any training issues. Thanks.

Thanks for your prompt response! I will try it later.