seungwonpark / melgan

MelGAN vocoder (compatible with NVIDIA/tacotron2)
http://swpark.me/melgan/
BSD 3-Clause "New" or "Revised" License
636 stars 116 forks source link

How to inference using MelGAN given a tacotron mel spec output? #46

Open OswaldoBornemann opened 4 years ago

OswaldoBornemann commented 4 years ago

When i trained melgan with original wav's mel spec, the result went well.

But when i tried to feed tacotron mel spec output into trained melgan model, the sound just all bee. Would you mind sharing some advice? thanks a lot. @seungwonpark

CookiePPP commented 4 years ago

upload sound samples?

OswaldoBornemann commented 4 years ago

@CookiePPP Please set the volume into lowest... I don't want to hurt your ears...

bad result.wav.zip

CookiePPP commented 4 years ago

Do you have the code you used to feed the tacotron outputs into melgan uploaded somewhere? That's definitely bugged out.

OswaldoBornemann commented 4 years ago

@CookiePPP The process are kind like below:

First i get the mel spec output from tacotron, using like

# mel sent shape is (spec_length, 80)
mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)

Then i unsqueeze and transpose the mel result to feed into MelGAN.

checkpoint_path = "./melgan/chkpt/id_test1/id_test1_aca5990_0700.pt"
config = "./melgan/config/id_test1.yaml"

checkpoint = torch.load(checkpoint_path)
# if args.config is not None:
#     hp = HParam(config)
# else:
hp = load_hparam_str(checkpoint['hp_str'])

melgan_model = Generator(hp.audio.n_mel_channels).cuda()
melgan_model.load_state_dict(checkpoint['model_g'])
melgan_model.eval()

with torch.no_grad():
    mel = torch.from_numpy(mel_sent).unsqueeze(0).transpose(2, 1)
    mel = mel.cuda()

    audio = model.inference(mel)
    audio = audio.cpu().detach().numpy()
CookiePPP commented 4 years ago
mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)

Where does this line come from? This repo is designed to inferface with NVIDIA/Tacotron. Nvidia uses their own Spectrogram conversion that I believe outputs values between -12 and 2.

OswaldoBornemann commented 4 years ago

@CookiePPP I see. I use mozilla tts instead.

OswaldoBornemann commented 4 years ago

@CookiePPP I would like to know that whether could we use tacotron gta output to train melgan

CookiePPP commented 4 years ago

@tsungruihon You should be able to scale the output and get an audible result. I don't know what range Mozilla TTS has, but try to transform the Mozilla output to match the Nvidia one. e.g

mel_sent = tacotron_out(model, sentence, CONFIG, use_cuda, ap, use_gl=use_gl, figures=True)
mel_sent = (mel_sent * 0.5) + 2

and replace 0.5 and +2 with the values that move the spectrogram between -12 and 2.

@CookiePPP I would like to know that whether could we use tacotron gta output to train melgan

Note sure, I'm busy today so I can't really help you there.

OswaldoBornemann commented 4 years ago

@CookiePPP Really appreciated. Thanks a lot.

mennatallah644 commented 3 years ago

I face the same problem Did you find a solution? @tsungruihon

OswaldoBornemann commented 3 years ago

Please visit https://github.com/mozilla/TTS