How much data is enough?

jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

MIT License

1.98k stars 506 forks source link

How much data is enough? #84

Open Mikefizzy opened 3 years ago

Mikefizzy commented 3 years ago

So I'm using this with maskcyclegan voice conversion and I only have 1 hour of data of the speaker.

Mikefizzy commented 3 years ago

Okay just noticed something strange

https://i.gyazo.com/753e21a859ae494ad265d08efb6e76ac.png at training step 0 starting with the universal model it outputs a spectrogram that somewhat resembles the original. The quality is bad but you can hear muffly speech.

Then 100 steps later it just goes blank https://i.gyazo.com/3110f6e0a5f08d78eae096419ffe142c.png

Mikefizzy commented 3 years ago

Okay I figured that fine tuning is a really bad idea cuz I just trained for 100 steps without it and I'm getting clean crisp audio.

Taking such large steps is not a good idea with a pretrained model because it probably makes such big changes that it throws away everything that it learned before.

something to think about

Mikefizzy commented 3 years ago

Something else worth mentioning is the generator loss was 6 when it was outputting the empty spectrograms

and when it output the good spectrograms https://i.gyazo.com/d93e91db946ee7688fca02aa72309176.png the loss is like 28 - 30

which shows that the generator loss isn't a direct indication of quality

Megh-Thakkar commented 3 years ago

Hi, just wanted to confirm that by

finetuning is a really bad idea

you mean to say that you trained hifi-gan from scratch on 1 hour of user recording only? Can you mention the hyperparameter settings (especially the number of epochs as the dataset is so small).

Thanks.

Mikefizzy commented 3 years ago

Im a stupid idiot. I thought fine tuning means resetting the learning rate.

Now I know it wasn't training on the desired mel spectrograms.

I'm gona try training maskcyclegan vc with the mel spectrograms that hifigan generates. And then training the vocoder on the output spectrograms