jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.98k stars 506 forks source link

Sound Quality of Trained Models #113

Closed takashin3391 closed 2 years ago

takashin3391 commented 2 years ago

Thank you for your support.

Do the sample voices on the following pages (e.g., Real Demo for Ted Talk) use any of the publicly available trained models? https://daps.cs.princeton.edu/projects/HiFi-GAN/index.php?env-pairs=VCTK&speaker=p257&src-env=all

I inputted reverberant speech into the training model, but the sound quality was not as good as the demo speech. There is jittery noise in the mix. What could be the cause of this?

takashin3391 commented 2 years ago

I input the original input from the Real Demo for Ted Talk on the following sample page into the trained model. https://daps.cs.princeton.edu/projects/HiFi-GAN/index.php?env-pairs=VCTK&speaker=p232&src-env=all

However, the output data did not result in the HiFi-GAN enhanced result on the sample page. The initial clapping can be heard. Also, there is a jittery sound mixed in.

Is the model used in the sample page different from the publicly available trained model? Why are the results different?

evrrn commented 2 years ago

This repo is devoted to the model for speech synthesis from the paper HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. The examples you referred to are for model for denoising from the completely different paper HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks.

It's two separate models which got the same name hifi-gan by coincidence.

takashin3391 commented 2 years ago

Thank you very much for your answer. I understand that the sample page I showed is different from this repository.