jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.67k stars 1.23k forks source link

Anybody having luck fine tuning? #14

Closed TaoTeCha closed 3 years ago

TaoTeCha commented 3 years ago

I'm using a clean 40hr dataset, female American, that I used on tacotron with good results. I've trained on VITS twice now and it starts over fitting around 70K. It's definitely intelligible and in the correct tone but the prosody is way off. First run had default configs. Second run I tried decreasing learning rate and lr decay. It helped some with overall loss, but still started overfitting around 70K.

alexpeattie commented 3 years ago

When you say fine tuning, do you mean you're initializing the weights from the pretrained LJSpeech model, as opposed to training from scratch? I've trained on a ~15 hour single-speaker dataset from scratch and gotten good results 🙂 .

When you say it's overfitting, what do you mean exactly? That it's able to generate sentences in your training set well, but can't generate novel sentences with correct prosody? Or are you referring to the learning curves you're seeing?

TaoTeCha commented 3 years ago

I am using '!python train.py -c configs/ljs_base.json -m ljs_base' with a config file that points to my own datasets. So I assumed that was using the pretrained weights. How would I start from complete scratch?

To be clear, I'm getting decent results. But my model isn't nearly as natural as the examples, and that is the only thing I'm looking for right now. The total loss starts to level out and even increase after around 70k and the examples in the tensorboard start to sound slightly worse. I decided to just keep training and it does seem to oscillate from increasing slightly to decreasing, increasing slightly then decreasing. Overall I think it's decreasing very slowly still and I'm at 104k. I'll just keep running and see what happens.

alexpeattie commented 3 years ago

Ah in that case what you're doing is training from scratch, rather than fine tuning. You'd have to download the pretrained models, but in fact I think only the generator is provided not the discriminator? So I'm not sure if the pretrained models linked are suitable for fine tuning?

Yes I think it's definitely worth continuing training, even if the loss seems to be plateauing. I believe the paper says the LJSpeech model was trained to 800k steps.

Liujingxiu23 commented 3 years ago

@alexpeattie Could you share your tensorboard show loss images?After how many steps (G_*.pth) of training, the synthesized wavs is understandable or of decent quality? I tried to train the model using my own chinese datasets, but the training seems abnormal, and the synthesized wavs are bad under G_180k.pth with the settings as vctk_base.json expect the sampling rate is 16000

alexpeattie commented 3 years ago

Here you go:

Capture d’écran 2021-06-28 à 08 35 20 Capture d’écran 2021-06-28 à 08 35 57

It seems that the discriminator loss is improving steadily, while the mel loss is improving on the generator while other losses are climbing. However, generated samples seem to be getting slightly better.

The synthesized sound very good in terms of spectrogram inversion from only a few 10k steps (e.g. no tininess/reverb/warbling). And the prosody seems to be getting better as the training goes on. But for me by 180k it was already sounding better than Tacotron, so I suspect something might have gone wrong with your training.

Liujingxiu23 commented 3 years ago

@alexpeattie Thank you very much for you reply. I also guess maybe I make mistake in some steps. I will try to figure it out. Your tensorboard are very good references!

TaoTeCha commented 3 years ago

Can someone explain why all these losses are going up? My graphs look similar.

LG-SS commented 3 years ago

Here you go:

Capture d’écran 2021-06-28 à 08 35 20 Capture d’écran 2021-06-28 à 08 35 57

It seems that the discriminator loss is improving steadily, while the mel loss is improving on the generator while other losses are climbing. However, generated samples seem to be getting slightly better.

The synthesized sound very good in terms of spectrogram inversion from only a few 10k steps (e.g. no tininess/reverb/warbling). And the prosody seems to be getting better as the training goes on. But for me by 180k it was already sounding better than Tacotron, so I suspect something might have gone wrong with your training.

The discriminator becomes better through training. In other words, the discriminator can correctly classify generated samples as false, so the g_losss increase and these losses is backforward to force generator to generate more realistic samples.

TaoTeCha commented 3 years ago

I understand what you're saying in theory, but wouldn't the generator then become better in generating with the improvement of the discriminator, resulting in smaller generation loss? I guess I don't know exactly how the generator loss is calculated.

This is my first time training a GAN. I'm not saying I'm right, I'm just saying it seems counterintuitive that the losses are designed in a way that total loss is going up but the model is continuing to improve.

I'm at 180K now and I guess it sounds a bit better than 80K, but probably not by much. Total loss has been increasing slightly or remaining constant since about 80K. But mel and dur and d_total losses are still decreasing. Maybe these are a better indicator of improvement. I'm just going to keep going.

CookiePPP commented 3 years ago

@TaoTeCha https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/losses.py#L25-L26 The GAN losses are MSE, so the total loss goes up when one is doing better than the other, despite the fact that their relative performance doesn't mean anything about the performance of the overall model. I guess I'm just saying, ignore generation and discriminator losses. As long as they both are above 0 the model should still be improving.


In this case, the loss is just going up because the discriminator has higher loss than the generator. If the GAN losses were calculated with MAE then the total loss wouldn't change when the discriminator or generator got better than the other. As one got better by X amount, the other one would get worse by X amount and the total loss would be unchanged. (I believe) MSE is used because it learns faster and we want the gradients to be higher in whichever network is performing worse so the models will balance themselves naturally. Otherwise you would use MAE to check the performance since it doesn't make the total loss look like it's going up.

https://www.desmos.com/calculator/scnamb2cys pretty graph for the idea, the total loss is at it's lowest when the discriminator gives 50% confidence on every sample it sees, despite the fact that during training you actually want the discriminator to be better than the generator so that useful gradients can be propagated to the generator.


edit again: d_g values are the discriminators MSE loss at predicting if a fake sample is fake dr values are the discriminators MSE loss at predicting if a real sample is real g{1...5} values are the discriminators MSE loss at predicting if a fake sample is real (then you send those gradients to the generator, so the generator tries to update it's weights in a way that will make the discriminator output "real" when it sees fake)

I believe the other loss values are ones that matter e.g: dur, kl and mel should be going down in an improving model


edit3: @alexpeattie ~~Did you get an improvement from g/fm loss? I ended up removing it from my models, but I'm curious if it works better for Waveforms than the random crap I've been playing with.~~

alexpeattie commented 3 years ago

@alexpeattie Did you get an improvement from g/fm loss?

Hey @CookiePPP, sorry, I'm not 100% sure what you're asking? Do you mean did the g/fm loss improve as I continued training?

CookiePPP commented 3 years ago

@alexpeattie Ah sorry, ignore that

TaoTeCha commented 3 years ago

Well I ran to 300K and it sounds great. Definitely better in audio quality, prosody, and pronunciation than Tacotron2. Some issues with pronunciation or speed come up in some of the samples but overall it sounds awesome. I think part of the problem was that I was only listening to the example from the tensorboard for the higher checkpoints, which sounded a lot worse than anything I am hearing on inference. Don't base the model quality off of what you hear in the tensorboard.

And thanks @CookiePPP for the detailed response. You always show up out of nowhere to help those in need.

Liujingxiu23 commented 3 years ago

@alexpeattie After I found and fixed mistakes in my code, trainning works fine, and the synthesized wavs are excellent! Could you please help me figure out this problem? https://github.com/jaywalnut310/vits/issues/13

nartes commented 3 years ago

@alexpeattie what hardware did you use? am trying to squeeze the model into 8GiB GPU, reduced some model parameters. but get negative loss value for duration predictor, tried to train as long as 200K,500K steps.

xinghua-qu commented 2 years ago

@CookiePPP I strongly agree with you on "I believe the other loss values are ones that matter e.g: dur, kl and mel should be going down in an improving model".

However, the training curves show that these discriminator independent losses (say kl and fm) do not decrease along the training process. For me, it is indeed quite counter intuitive. Any deeper understanding for this? Thanks ahead

CookiePPP commented 2 years ago

@Teddy-QU loss/fm can be ignored as it's numerical value doesn't actually mean anything on it's own. loss/kl is linked to loss/mel. Search up Autoencoders and VAEs if you're not familiar on the concept, but basically the model can trade between kl and mel loss as it trains (e.g: model can get better mel loss by increasing it's kl loss. The model has a natural balancing point but that might shift over training). loss/kl going up isn't automatically bad, it's only bad if loss/mel increases at the same time.


So uh, summary of how to read the loss terms is as follows;

xinghua-qu commented 2 years ago

@CookiePPP Thanks for the awesome sharing. I understand that the loss/kl and loss/mel contribute together to the evidence lower bound of VAE. The loss plot below indeed aligns with the trend. image

Although loss/fm is designed to represent the reconstruction loss that is measured in the hidden layers of the discriminator, given the adversarial training style, its value indeed loses its practical meaning.

ToiYeuTien commented 11 months ago

Hi everybody ! I have trained a Vietnamese female voice model in 500k steps. and I found the voice quite clear. I want to train another male Vietnamese voice. I learned there is a training method based on a previously trained model, which will shorten the training time. Can someone help me with that method. Thank you !

ylacombe commented 8 months ago

I've created a repo to finetune VITS (and MMS, the multilingual Meta version of VITS) using Hugging Face implementation, feel free to take a look: https://github.com/ylacombe/finetune-hf-vits

You can try some finetuned models in this demo and in that one.