jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.91k stars 1.27k forks source link

Korean Multi-speaker Model Convergence Failure #166

Closed heesuju closed 1 year ago

heesuju commented 1 year ago

Hello, I'm currently training a Korean multi-speaker model with 5 speakers. The batch size is 50, with 8 hours of data that has a sampling rate of 22050. However, I have no idea how to interpret the following results in tensorboard.

image

From what I can understand, vits uses GAN, so it has a generator and a discriminator. If the total discriminator loss goes down, it is getting better at finding fake sounds, motivating the generator to get better.

However, whenever I train a new model, the two losses diverge with discriminator overpowering the generator. This does not seem to be a problem in another issue posted here: https://github.com/jaywalnut310/vits/issues/13#issuecomment-904314885

I was hoping there would be a more definite answer for the following questions:

  1. Is this convergence failure normal for VITS?
  2. If not, is my dataset the problem? Are there any possible solutions?

Any help would be appreciated. Thank you.

p0p4k commented 1 year ago

I think you just need to train longer, one your duration predictor is around 0.2, and mel between 12-15, you can expect good results.

heesuju commented 1 year ago

I think you just need to train longer, one your duration predictor is around 0.2, and mel between 12-15, you can expect good results.

Thank you for your help!

I tried training again with the same dataset after changing the following:

  1. Batch Size 50->32
  2. Fixed problem where some parts of the audio in the dataset were cut out. With this, I got the following tensorboard results after training for 1594 epochs: image

The mel loss did get lower, and the audio quality improved considerably, sounding more natural.

Could you tell me if I'm understanding this correctly?

  1. Duration predictor loss shows how well the model can generate various rythm
  2. Mel loss shows how well the generator can recreate target mel-spectrograms. In other words, is it correct to assume that these two values have the most impact in assessing the quality of speech generation?

Again, thank you for your help!!

p0p4k commented 1 year ago
  1. During training , the duration predictor is not used. It just trained separately (in parallel), so that later on, during inference you can get close to real duration. The duration here is number of frames in the spectrogram. We are trying to match the text length to spectrogram length. During training ,we just use the output wav file spectrogram real length and dont pressure the model to use the duration predictor.
  2. mel loss is L1 loss -> meaning (y_hat - y) for the sliced mel spectrogram. We converse memory during training by slicing a small part of the output and comparing that part. After few experiments, I think changing l1 loss to l2 loss is better; but it should not make too much difference. -- batch size makes a difference cause that is number of data points the model is trying to fit/learn at one forward step. But in my experience, even if we use bigger batch size, eventually with longer training, everything must converge (i am not sure about it). You can change learning rate parameter to match the batch size (but might need to read a little more about it in some other literature).

-- i am trying to implement vits2 and it will be helpful if you can try that model and give me some feedback as well if possible. Thanks.

heesuju commented 1 year ago

Thank you! That certainly cleared alot of things up for me. I've been wanting to try vits2, especially with all the improvements to multi-speaker training. I will absolutely try it as soon as it's out! Best of luck and thank you for all your help!