ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
1.75k stars 526 forks source link

Support to HiFiGan #23

Open loretoparisi opened 3 years ago

loretoparisi commented 3 years ago

HiFiGan has sota results in wav generation from mel spectrograms

Schermata 2020-12-23 alle 12 14 08

Is it possibile to add support to hifigan model, after the mel generation, in order to create the wave file?

    mel, mel_postnet, log_duration_output, f0_output, energy_output, _, _, mel_len = model(text, src_len)

    mel_torch = mel.transpose(1, 2)
    mel_postnet_torch = mel_postnet.transpose(1, 2)
    mel = mel[0].cpu().transpose(0, 1)
    mel_postnet = mel_postnet[0].cpu().transpose(0, 1)
    f0_output = f0_output[0].cpu().numpy()
    energy_output = energy_output[0].cpu().numpy()

    if not os.path.exists(hp.test_path):
        os.makedirs(hp.test_path)

    if melgan is not None:
        with torch.no_grad():
            wav = melgan.inference(mel_torch).cpu().numpy() # use here hifgan?
            wav = wav.astype('int16')
            #ipd.display(ipd.Audio(wav, rate=hp.sampling_rate))
            # save audio file
            write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, wav)

or some additional adaptation would be needed?

In the case of the end-to-end inference with hifi gan the generation code would look like

def inference(a):
    generator = Generator(h).to(device)

    state_dict_g = load_checkpoint(a.checkpoint_file, device)
    generator.load_state_dict(state_dict_g['generator'])
    generator.eval()
    generator.remove_weight_norm()
    with torch.no_grad():
        x = torch.FloatTensor( mel_torch ).to(device)
        y_g_hat = generator(x)
        audio = y_g_hat.squeeze()
        audio = audio * MAX_WAV_VALUE
        audio = audio.cpu().numpy().astype('int16')
       write(os.path.join(GENERATED_SPEECH_DIR, prefix + '.wav'), hp.sampling_rate, audio)

where mel_torch is our mel spectrogram.

ming024 commented 3 years ago

Thanks for your suggestion. It is supported now and indeed the audio quality is much better!

loretoparisi commented 3 years ago

@ming024 super, let me try it out How I can choose it for the english voice? thanks

chrr commented 3 years ago

Hi, thanks for your efforts in putting this amazing repo together! With your latest changes, I get

FileNotFoundError: [Errno 2] No such file or directory: 'hifigan/config.json'

when running synthesize.py. Would you mind adding the hifigan config as well?

ming024 commented 3 years ago

@loretoparisi In my experience vocoders are generally independent or weakly-dependent to languages. So feel free to try it. @chrr I somehow forgot to upload the hifigan/ directory. It should be fixed now.

zaidalyafeai commented 3 years ago

Hey @ming024 , I am working on Arabic which has a different script than English, will that affect the results ? Also, should I use the universal hifigan model?

ming024 commented 3 years ago

@zaidalyafeai I believe the universal HiFiGAN yields the best result for unknown speakers. I also think that there may not be a great performance drop of the pretrained vocoders for different languages, as long as the same preprocessing hyperparameters are used.

zaidalyafeai commented 3 years ago

Thanks @ming024 , I tested both vocoders and indeed the universal is much better. Which preprocessing hyperparameters mostly affect the vocoders?

ming024 commented 3 years ago

@zaidalyafeai the preprocessing parameters should match that of the pretrained vocoders, or there may be strange results.

malradhi commented 2 years ago

@zaidalyafeai @ming024 Did you use the pre-trained one that already exists for the Universal vocoder? Or did you train it from scratch when you used it, for example, with Arabic data? I am trying now to add a pretrained VITS vocoder to the Fastspeech (using the same preprocessing hyperparameters). However, I only get the noisy voice generated. Thanks for your answer in advance!

JohnHerry commented 1 year ago

@zaidalyafeai @ming024 Did you use the pre-trained one that already exists for the Universal vocoder? Or did you train it from scratch when you used it, for example, with Arabic data? I am trying now to add a pretrained VITS vocoder to the Fastspeech (using the same preprocessing hyperparameters). However, I only get the noisy voice generated. Thanks for your answer in advance!

No, you can not do that. The standard hifigan vocoder, is trained from mel spectrogram into wavform. so it can be used as vocoder to FastSpeech2. Your VITS decoder part 【have nearly the same structure with hifigan】 is trained to generate wavform from the VITS lattent variable "z", not mel-spectrogram. so they are COMPLETElY DIFFERENT.