Closed Alexey322 closed 3 years ago
Hi. You can find more details in section 4.4 of our paper, HiFi-GAN scored better than WaveGlow even before fine-tuning, and higher after fine-tuning. Fine tuning is not mandatory, but I would recommend it if you want the best quality.
We made no modifications to the model in our multi-speaker experiments. We experimented with the VCTK dataset(109 speakers) and got very good results. You can find more details in our paper. It's a bit difficult to comment on lower bounds of datasets for a model for unseen speakers that can generate high quality audio. Since a model for unseen speakers is significantly affected by the recording quality as well as the speech characteristics of speakers, the quality can vary greatly depending on the training dataset and the quality of the input condition during inference.
I close this as there are no recent updates. Please reopen if anyone needs additional comments.
Hey. Why does the readme say that you need to use GTA mels for fine tuning? I used real spectrograms to train waveglow and parallel wavegan, the authors of which indicated that this method achieves acceptable quality in conjunction with tacotron 2. Is the GTA training procedure mandatory for this vocoder to achieve the best possible quality?
And another question about the multi-speaker model. Do I need any modifications to train a multispeaker model, or is it enough to generate spectrograms of different speakers and train on them? Also, what is the minimum number of speakers required for the model to reproduce speakers invisible during training in good quality?