Closed tulasiram58827 closed 3 years ago
Do you make sure the train mel spec is match with hifigan? In frequency picture, it seems like that its energy is too big, I think you use wrong mel. You can use the extracted mel for tacotron2 training to synthesis audio directly for checking your extracted mel.
I am using the extracted mel spectrogram from Tacotron2 to synthesize audio with HiFi GAN. If you see closely in the provided notebook the same mel is working fine with parallel wavegan model which is also finetuned on LJSpeech. Do you want me to add any pre-processing of mel spectrogram before feeding to HiFi GAN model.
I am using the extracted mel spectrogram from Tacotron2 to synthesize audio with HiFi GAN. If you see closely in the provided notebook the same mel is working fine with parallel wavegan model which is also finetuned on LJSpeech. Do you want me to add any pre-processing of mel spectrogram before feeding to HiFi GAN model.
Here is the core code of hifigan melspectrum function. Make sure mel of tacotron2, pwgan, hifigan is same. if you plan to use pretrain model of hifigan.
I understand the code to generate Mel Spectrogram. But I am using a pre-trained model of Tacotron2 from Mozilla TTS repo and I cannot find the training pipeline how they generated Mel Spectrogram.
Can you suggest any other alternative on how to make this work? Like any pre-processing methods before feeding into HiFi-GAN model. Because I worked with Tacotron2, FastSpeech2, PWGAN, MelGAN, MB-MelGAN all these working as expected.
I understand the code to generate Mel Spectrogram. But I am using a pre-trained model of Tacotron2 from Mozilla TTS repo and I cannot find the training pipeline how they generated Mel Spectrogram.
Can you suggest any other alternative on how to make this work? Like any pre-processing methods before feeding into HiFi-GAN model. Because I worked with Tacotron2, FastSpeech2, PWGAN, MelGAN, MB-MelGAN all these working as expected.
I think the best idea is to retrain a hifigan or a tacotron2. In my opinions, retrain tacotron2 from pretrain model with hifigan features may cost less time and gpu resource.
@jik876 @Edresson any other suggestions?
@tulasiram58827
edit:
Nevermind, you could just retrain the hifi-gan model using the MozillaTTS spectrograms, which would take way less effort 😄
To start with, these spectrogram functions are massively different.
https://github.com/jik876/hifi-gan/blob/master/meldataset.py#L28 https://github.com/mozilla/TTS/blob/master/TTS/utils/audio.py#L192
You can try to multiply the tacotron outputs by 2.3026
if you aren't using normalization or spec_gain on the spectrograms. That will convert between Log10() and Log() magnitudes. If it sounds too loud or too quiet, you can just add or minus a constant for testing.
This audio function in MozillaTTS looks way to complicated for a simple function for adapting/converting. At minimum an example config file would be nice.
Thanks, @CookiePPP your suggestion didn't work out. I will wait for a few days if anyone else has any other suggestions. Otherwise I will retrain
Thanks, @CookiePPP your suggestion didn't work out. I will wait for a few days if anyone else has any other suggestions. Otherwise I will retrain
if you used the extracted mel spectrogram(other wavs which not in training dataset) from HiFi GAN to synthesize audio with HiFi GAN ? is the synthesize wav sound well?
Yes audio sound well and one can understand well if they have prior knowledge of the text used But the pitch is too high I think. You can go ahead and generate the audio file with this notebook .
Text used : "I am an avenger"
Publicly available speech synthesis implementations use several methods to generate mel-spectrogram. The implementation we provided is compatible with NVIDIA Tacotron2 and Glow-TTS author's implementation. Therefore, depending on the method of generating mel-spectrogram of the 1st stage model you want to use, the pre-trained model we provided may not be compatible. If the difference is on the log scale, it can be easily converted to multiplication or division, but if other complex pre-processing is involved, I think it would be good to retrain the model. Additionally, I think the simpler way would be best if we could get the same performance.
Thanks, @jik876 I will try to retrain the model.
Do you want me to open a PR regarding this?
Also, I created an End to End Colab Notebook to run the Hifi GAN Model. If you want to me add this to your repo I will go ahead and create a PR
How was this resolved? I have the exact same probelm.
I have the same problem. How did you solve this problem of a loud buzzing noise @faranaziz or @tulasiram58827 ?
Hi @jik876 @Edresson
I have been trying to integrate tacotron2 and Hifi gan to create fully end to end TTS. But when I am feeding Tacotron2 output to your finetuned model of HiFi GAN output audio is just buzzing sound. To make sure tacotron2 output is correct, I fed the tacotron2 output to the parallel wavegan model, and it's working as expected. So believe there is some incompatibility while feeding tacotron2 output to Hifi gan output. To reproduce the same I created the colab notebook. You can reproduce the output with the above-mentioned notebook.
Also, I created and End to End Colab Notebook to run the Hifi GAN Model. If you want to me add this to your repo I will go ahead and create a PR
Also, I am in a plan to convert this Hifi GAN model to TFLite format to help mobile developers. We(@sayakpaul) already converted few models to TFLite format. You can find more details about our TFLite repo here