jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.92k stars 506 forks source link

Some questions about training #36

Closed Alexey322 closed 3 years ago

Alexey322 commented 3 years ago

Hey. Why does the readme say that you need to use GTA mels for fine tuning? I used real spectrograms to train waveglow and parallel wavegan, the authors of which indicated that this method achieves acceptable quality in conjunction with tacotron 2. Is the GTA training procedure mandatory for this vocoder to achieve the best possible quality?

And another question about the multi-speaker model. Do I need any modifications to train a multispeaker model, or is it enough to generate spectrograms of different speakers and train on them? Also, what is the minimum number of speakers required for the model to reproduce speakers invisible during training in good quality?

jik876 commented 3 years ago

Hi. You can find more details in section 4.4 of our paper, HiFi-GAN scored better than WaveGlow even before fine-tuning, and higher after fine-tuning. Fine tuning is not mandatory, but I would recommend it if you want the best quality.

We made no modifications to the model in our multi-speaker experiments. We experimented with the VCTK dataset(109 speakers) and got very good results. You can find more details in our paper. It's a bit difficult to comment on lower bounds of datasets for a model for unseen speakers that can generate high quality audio. Since a model for unseen speakers is significantly affected by the recording quality as well as the speech characteristics of speakers, the quality can vary greatly depending on the training dataset and the quality of the input condition during inference.

jik876 commented 3 years ago

I close this as there are no recent updates. Please reopen if anyone needs additional comments.