jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.96k stars 505 forks source link

Buzzing sound when using Tacotron2+HiFi-GAN #41

Closed tulasiram58827 closed 3 years ago

tulasiram58827 commented 3 years ago

Hi @jik876 @Edresson

I have been trying to integrate tacotron2 and Hifi gan to create fully end to end TTS. But when I am feeding Tacotron2 output to your finetuned model of HiFi GAN output audio is just buzzing sound. To make sure tacotron2 output is correct, I fed the tacotron2 output to the parallel wavegan model, and it's working as expected. So believe there is some incompatibility while feeding tacotron2 output to Hifi gan output. To reproduce the same I created the colab notebook. You can reproduce the output with the above-mentioned notebook.

Also, I created and End to End Colab Notebook to run the Hifi GAN Model. If you want to me add this to your repo I will go ahead and create a PR

Also, I am in a plan to convert this Hifi GAN model to TFLite format to help mobile developers. We(@sayakpaul) already converted few models to TFLite format. You can find more details about our TFLite repo here

Miralan commented 3 years ago

Do you make sure the train mel spec is match with hifigan? In frequency picture, it seems like that its energy is too big, I think you use wrong mel. You can use the extracted mel for tacotron2 training to synthesis audio directly for checking your extracted mel. image

tulasiram58827 commented 3 years ago

I am using the extracted mel spectrogram from Tacotron2 to synthesize audio with HiFi GAN. If you see closely in the provided notebook the same mel is working fine with parallel wavegan model which is also finetuned on LJSpeech. Do you want me to add any pre-processing of mel spectrogram before feeding to HiFi GAN model.

Miralan commented 3 years ago

I am using the extracted mel spectrogram from Tacotron2 to synthesize audio with HiFi GAN. If you see closely in the provided notebook the same mel is working fine with parallel wavegan model which is also finetuned on LJSpeech. Do you want me to add any pre-processing of mel spectrogram before feeding to HiFi GAN model.

Here is the core code of hifigan melspectrum function. Make sure mel of tacotron2, pwgan, hifigan is same. if you plan to use pretrain model of hifigan. image

tulasiram58827 commented 3 years ago

I understand the code to generate Mel Spectrogram. But I am using a pre-trained model of Tacotron2 from Mozilla TTS repo and I cannot find the training pipeline how they generated Mel Spectrogram.

Can you suggest any other alternative on how to make this work? Like any pre-processing methods before feeding into HiFi-GAN model. Because I worked with Tacotron2, FastSpeech2, PWGAN, MelGAN, MB-MelGAN all these working as expected.

Miralan commented 3 years ago

I understand the code to generate Mel Spectrogram. But I am using a pre-trained model of Tacotron2 from Mozilla TTS repo and I cannot find the training pipeline how they generated Mel Spectrogram.

Can you suggest any other alternative on how to make this work? Like any pre-processing methods before feeding into HiFi-GAN model. Because I worked with Tacotron2, FastSpeech2, PWGAN, MelGAN, MB-MelGAN all these working as expected.

I think the best idea is to retrain a hifigan or a tacotron2. In my opinions, retrain tacotron2 from pretrain model with hifigan features may cost less time and gpu resource.

tulasiram58827 commented 3 years ago

@jik876 @Edresson any other suggestions?

CookiePPP commented 3 years ago

@tulasiram58827

edit:

Nevermind, you could just retrain the hifi-gan model using the MozillaTTS spectrograms, which would take way less effort 😄


To start with, these spectrogram functions are massively different.

https://github.com/jik876/hifi-gan/blob/master/meldataset.py#L28 https://github.com/mozilla/TTS/blob/master/TTS/utils/audio.py#L192

You can try to multiply the tacotron outputs by 2.3026 if you aren't using normalization or spec_gain on the spectrograms. That will convert between Log10() and Log() magnitudes. If it sounds too loud or too quiet, you can just add or minus a constant for testing. This audio function in MozillaTTS looks way to complicated for a simple function for adapting/converting. At minimum an example config file would be nice.

tulasiram58827 commented 3 years ago

Thanks, @CookiePPP your suggestion didn't work out. I will wait for a few days if anyone else has any other suggestions. Otherwise I will retrain

ysujiang commented 3 years ago

Thanks, @CookiePPP your suggestion didn't work out. I will wait for a few days if anyone else has any other suggestions. Otherwise I will retrain

if you used the extracted mel spectrogram(other wavs which not in training dataset) from HiFi GAN to synthesize audio with HiFi GAN ? is the synthesize wav sound well?

tulasiram58827 commented 3 years ago

Yes audio sound well and one can understand well if they have prior knowledge of the text used But the pitch is too high I think. You can go ahead and generate the audio file with this notebook .

Text used : "I am an avenger"

jik876 commented 3 years ago

Publicly available speech synthesis implementations use several methods to generate mel-spectrogram. The implementation we provided is compatible with NVIDIA Tacotron2 and Glow-TTS author's implementation. Therefore, depending on the method of generating mel-spectrogram of the 1st stage model you want to use, the pre-trained model we provided may not be compatible. If the difference is on the log scale, it can be easily converted to multiplication or division, but if other complex pre-processing is involved, I think it would be good to retrain the model. Additionally, I think the simpler way would be best if we could get the same performance.

tulasiram58827 commented 3 years ago

Thanks, @jik876 I will try to retrain the model.

Do you want me to open a PR regarding this?

Also, I created an End to End Colab Notebook to run the Hifi GAN Model. If you want to me add this to your repo I will go ahead and create a PR

faranaziz commented 3 years ago

How was this resolved? I have the exact same probelm.

mepc36 commented 2 years ago

I have the same problem. How did you solve this problem of a loud buzzing noise @faranaziz or @tulasiram58827 ?