jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.94k stars 506 forks source link

FineTuning HiFi with GLowTTS npy #30

Open 2Bye opened 3 years ago

2Bye commented 3 years ago

Hello! I'm trying to FineTuning HiFi with GlowTTS npy i generate npy with this code:

def TTS(tst_stn, path):
    if getattr(hps.data, "add_blank", False):
        text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
        text_norm = commons.intersperse(text_norm, len(symbols))
    else: 
        tst_stn = " " + tst_stn.strip() + " "
        text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
    sequence = np.array(text_norm)[None, :]
    x_tst = torch.autograd.Variable(torch.from_numpy(sequence)).cuda().long()
    x_tst_lengths = torch.tensor([x_tst.shape[1]]).cuda()

   with torch.no_grad():
        noise_scale = 0.667
        length_scale = 1.0
        (y_gen_tst, *_), *_, (attn_gen, *_) = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)

    np.save("hf/ft_dataset/" + path.split('/')[1]  + '.npy', y_gen_tst.cpu().detach().numpy())

Next, I make a metafile: wavs/x.wav | ft_dataset/x.npy

And I get the following error: RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [6623] at entry 6

Hi-Fi generates wav using these npy in inference mode with GlowTTS

CookiePPP commented 3 years ago

I do not think you need/should fine-tune HiFi-GAN when using Glow-TTS. Glow-TTS doesn't have an oversmoothing problem that fine-tuning could resolve.

Do you have any audio samples before fine-tuning? I'm pretty sure if there is an audio quality issue then it'd stem from Glow-TTS or your speaker being out of the training set that HiFi-GAN was trained on.

2Bye commented 3 years ago

I do not think you need/should fine-tune HiFi-GAN when using Glow-TTS. Glow-TTS doesn't have an oversmoothing problem that fine-tuning could resolve.

Do you have any audio samples before fine-tuning? I'm pretty sure if there is an audio quality issue then it'd stem from Glow-TTS or your speaker being out of the training set that HiFi-GAN was trained on.

I used the training data on which I taught GlowTTS and generate npy out of them

i will attach train wav file and npy wavs.zip

jik876 commented 3 years ago

@4nton-P

Hello. To get mel-spectrograms for fine tuning, you need to make some changes to the code. If you set the 'gen' argument to True, the length of the generated mel-spectrogram may not match the length of the ground truth audio. In the branch where 'gen' of the forward operation is False, there is a part that generates mean and variance using the output of the encoder and the output of the decoder. If you use these to sample z from Gaussian and feed it to the decoder with 'reverse=True', you will get the desired result. See lines 313 and 299 in models.py. And 'noise_scale' can affect the quality. You will get good results with the default settings, but experimenting with various 'noise_scale' would be a good try.

2Bye commented 3 years ago

@jik876 Thanks, I will try to do this. "noise_scale" has already experimented with tuning If everything works out, I will close the discussion.

romadomaa commented 3 years ago

@4nton-P

Hello. To get mel-spectrograms for fine tuning, you need to make some changes to the code. If you set the 'gen' argument to True, the length of the generated mel-spectrogram may not match the length of the ground truth audio. In the branch where 'gen' of the forward operation is False, there is a part that generates mean and variance using the output of the encoder and the output of the decoder. If you use these to sample z from Gaussian and feed it to the decoder with 'reverse=True', you will get the desired result. See lines 313 and 299 in models.py. And 'noise_scale' can affect the quality. You will get good results with the default settings, but experimenting with various 'noise_scale' would be a good try.

Hello @jik876 ! I took the code from this issue for generate .npy and am also trying to fine-tune HifiGan with GlowTTS. So, I set the parameter 'reverse = True' in models.py, generated .npy and got the same error.( RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [3037] at entry 1 ) What could have gone wrong?

jik876 commented 3 years ago

@romadomaa Please understand that we are a bit busy with other work. The above explanation is to match the length with the ground truth using MAS. Posting your modified code will be helpful to find a solution.

debasish-mihup commented 3 years ago

@jik876 Thanks, I will try to do this. "noise_scale" has already experimented with tuning If everything works out, I will close the discussion.

@4nton-P @jik876 Can you share the code changes to fine tune hifigan w.r.t. GlowTTS predicted Mel?

Rashi2011 commented 3 years ago

I am taking mels from fastspeech2 and trying to input it to hifigan to generate audio but I am getting noise in the audio file . I made it shape compatible but there are problems internally . please share your idea that I can try.