Closed KdaiP closed 1 year ago
Thanks a lot for the detailed bug report.
It will be helpful if you could send a PR for it and I will fix the codes according to your suggestions.
I will go through the training script once again and try to debug and fix the batch errors.
Maybe we can remove torch.sum(logs_p, dim=1).exp().unsqueeze(1)
and try. I will do some experiments around that and update the README as well.
Also, @KdaiP it is possible that you can train using use_noise_scaled_mas=False
flag in the config? It might be helpful to see which part of the model corresponds to how much improvement over base-VITS-1. Really thank you for your GPU resources!
Now that I think about it , we might have to use https://pytorch.org/docs/stable/generated/torch.std.html . I will update the code soon.
@KdaiP You may now try the new update and let me know about any training issues. Thanks.
@KdaiP You may now try the new update and let me know about any training issues. Thanks.
Thanks for your prompt response! I will try it later.
Thank you for your implementation on VITS2! I copied your noise MAS part into the original VITS and attempted to train it with multiple speakers across 2 GPUs. However, I encountered two errors in the process:
The first error arises when employing DDP (DistributedDataParallel), displaying the message: "AttributeError: 'DistributedDataParallel' object has no attribute 'net_g.mas_noise_scale_initial'". Solutions found online suggest using model.module instead of model. The code in train.py (line 181-182) may need modification as follows:
The second error originates in models.py (line 703):
epsilon = torch.sum(logs_p, dim=1).exp() * torch.randn_like(neg_cent) * self.current_mas_noise_scale
The error message is: "RuntimeError: The size of tensor 'a' must match the size of tensor 'b' at non-singleton dimension 1". Upon examining notebooks/MAS_with_noise.ipynb, the code appears to function correctly. Yet, altering the batch size from 1 to other values triggers the same error. I guess, when the batch size is not 1, the two tensors fail to meet the broadcasting condition. Introducing a dimension using unsqueeze(1) resolves the error (but I am unsure whether it is right). The code adjustment could be as follows:epsilon = torch.sum(logs_p, dim=1).exp().unsqueeze(1) * torch.randn_like(neg_cent) * self.current_mas_noise_scale
Upon introducing the noise, I observed a deteriorating trend in the MAS conclusion as the training steps increased. Additionally, the audio generated consisted mostly of silence and electric current-like sound. It seems that there might be some problem in the noise generation formula.