jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.97k stars 507 forks source link

How to use HiFi-GAN with mel-spectrogram from Tacotron #115

Open Adibian opened 2 years ago

Adibian commented 2 years ago

Hi Thanks for this great implementation. I trained Tacotron and used WaveRNN as vocoder and result was good. Now I want to use HiFi-GAN as vocoder so I cloned this project and run it with mel-spectrogram reached from Tacotron. But result was very noisy! You can listen to result here and some prints are as follow:

x = synthesizer.synthesize_spectrograms(texts, embeds)
print(x)
with torch.no_grad():
    x = torch.from_numpy(x).to(device)
    x = x.unsqueeze(0)
    print(x.shape)
    y_g_hat = generator(x)
    audio = y_g_hat.squeeze()
    audio = audio.cpu().numpy()

[[-3.7809916 -3.7548614 -3.810056 ... -3.900182 -3.9460294 -3.812808 ] [-3.9151711 -3.9173453 -4.018536 ... -3.9720445 -4.00277 -3.8934054] [-3.8845775 -3.906875 -4.005348 ... -3.9689758 -3.9909463 -3.8506253] ... [-3.4353547 -3.374465 -3.3435407 ... -3.312714 -3.3319786 -3.4587195] [-3.4256952 -3.3695805 -3.339043 ... -3.3475647 -3.3527846 -3.4885204] [-3.49198 -3.4372733 -3.4019573 ... -3.3571553 -3.3713949 -3.488055 ]]

torch.Size([1, 80, 325])

I got wav result of WaveRNN for this mel-spectrogram and run HiFi-GAN with this wav and result was great. Some print are as follow:

x = torch.FloatTensor(wav).to(device)
x = get_mel(x.unsqueeze(0))
print(x)
print(x.shape)
with torch.no_grad():
    y_g_hat = generator(x)
    print(y_g_hat.shape)
    audio = y_g_hat.squeeze()
    audio = audio.cpu().numpy()

tensor([[[-11.5129, -8.3294, -6.7415, ..., -11.5129, -11.5129, -11.5129], [-11.5129, -9.4963, -6.9599, ..., -11.5129, -11.5129, -11.5129], [-11.5129, -9.9807, -7.2935, ..., -11.5129, -11.5129, -11.5129], ..., [-11.5129, -9.8441, -8.7276, ..., -11.5129, -11.5129, -11.5129], [-11.5129, -10.4379, -9.1857, ..., -11.5129, -11.5129, -11.5129], [-11.5129, -10.4412, -9.2296, ..., -11.5129, -11.5129, -11.5129]]], device='cuda:0') torch.Size([1, 80, 327])

So HiFi-GAN work well but I think when using mel-spectrogram directly we should change some thing (like hop_size?) but what and how?

v-nhandt21 commented 2 years ago

I guess that your melspectrogram generated by Tacotron is normalized, if it is true, you should de-normalize by mean and std.