Open 2Bye opened 3 years ago
I do not think you need/should fine-tune HiFi-GAN when using Glow-TTS. Glow-TTS doesn't have an oversmoothing problem that fine-tuning could resolve.
Do you have any audio samples before fine-tuning? I'm pretty sure if there is an audio quality issue then it'd stem from Glow-TTS or your speaker being out of the training set that HiFi-GAN was trained on.
I do not think you need/should fine-tune HiFi-GAN when using Glow-TTS. Glow-TTS doesn't have an oversmoothing problem that fine-tuning could resolve.
Do you have any audio samples before fine-tuning? I'm pretty sure if there is an audio quality issue then it'd stem from Glow-TTS or your speaker being out of the training set that HiFi-GAN was trained on.
I used the training data on which I taught GlowTTS and generate npy out of them
i will attach train wav file and npy wavs.zip
@4nton-P
Hello. To get mel-spectrograms for fine tuning, you need to make some changes to the code. If you set the 'gen' argument to True, the length of the generated mel-spectrogram may not match the length of the ground truth audio. In the branch where 'gen' of the forward operation is False, there is a part that generates mean and variance using the output of the encoder and the output of the decoder. If you use these to sample z from Gaussian and feed it to the decoder with 'reverse=True', you will get the desired result. See lines 313 and 299 in models.py. And 'noise_scale' can affect the quality. You will get good results with the default settings, but experimenting with various 'noise_scale' would be a good try.
@jik876 Thanks, I will try to do this. "noise_scale" has already experimented with tuning If everything works out, I will close the discussion.
@4nton-P
Hello. To get mel-spectrograms for fine tuning, you need to make some changes to the code. If you set the 'gen' argument to True, the length of the generated mel-spectrogram may not match the length of the ground truth audio. In the branch where 'gen' of the forward operation is False, there is a part that generates mean and variance using the output of the encoder and the output of the decoder. If you use these to sample z from Gaussian and feed it to the decoder with 'reverse=True', you will get the desired result. See lines 313 and 299 in models.py. And 'noise_scale' can affect the quality. You will get good results with the default settings, but experimenting with various 'noise_scale' would be a good try.
Hello @jik876 ! I took the code from this issue for generate .npy and am also trying to fine-tune HifiGan with GlowTTS. So, I set the parameter 'reverse = True' in models.py, generated .npy and got the same error.( RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [3037] at entry 1 ) What could have gone wrong?
@romadomaa Please understand that we are a bit busy with other work. The above explanation is to match the length with the ground truth using MAS. Posting your modified code will be helpful to find a solution.
@jik876 Thanks, I will try to do this. "noise_scale" has already experimented with tuning If everything works out, I will close the discussion.
@4nton-P @jik876 Can you share the code changes to fine tune hifigan w.r.t. GlowTTS predicted Mel?
I am taking mels from fastspeech2 and trying to input it to hifigan to generate audio but I am getting noise in the audio file . I made it shape compatible but there are problems internally . please share your idea that I can try.
Hello! I'm trying to FineTuning HiFi with GlowTTS npy i generate npy with this code:
Next, I make a metafile: wavs/x.wav | ft_dataset/x.npy
And I get the following error: RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [6623] at entry 6
Hi-Fi generates wav using these npy in inference mode with GlowTTS