Open Mikefizzy opened 3 years ago
Okay just noticed something strange
https://i.gyazo.com/753e21a859ae494ad265d08efb6e76ac.png at training step 0 starting with the universal model it outputs a spectrogram that somewhat resembles the original. The quality is bad but you can hear muffly speech.
Then 100 steps later it just goes blank https://i.gyazo.com/3110f6e0a5f08d78eae096419ffe142c.png
Okay I figured that fine tuning is a really bad idea cuz I just trained for 100 steps without it and I'm getting clean crisp audio.
Taking such large steps is not a good idea with a pretrained model because it probably makes such big changes that it throws away everything that it learned before.
something to think about
Something else worth mentioning is the generator loss was 6 when it was outputting the empty spectrograms
and when it output the good spectrograms https://i.gyazo.com/d93e91db946ee7688fca02aa72309176.png the loss is like 28 - 30
which shows that the generator loss isn't a direct indication of quality
Hi, just wanted to confirm that by
finetuning is a really bad idea
you mean to say that you trained hifi-gan from scratch on 1 hour of user recording only? Can you mention the hyperparameter settings (especially the number of epochs as the dataset is so small).
Thanks.
Im a stupid idiot. I thought fine tuning means resetting the learning rate.
Now I know it wasn't training on the desired mel spectrograms.
I'm gona try training maskcyclegan vc with the mel spectrograms that hifigan generates. And then training the vocoder on the output spectrograms
So I'm using this with maskcyclegan voice conversion and I only have 1 hour of data of the speaker.