Multi-speaker TTS with ESPnet mel-spectrograms

Hello!

I have been following the system described in this paper by Y. Jia, et al Link. So far, I am done training the synthesizer module using ESPnet-Tacotron 2 multi-speaker tts scripts provided here: Link. I finished the training and resulted to intelligible speech, albeit robotic, using Griffin-Lim.

Now, in order to improve the synthesized outputs, I decided to train a wavenet vocoder using the synthesized mel-spectrograms (produced mel-specs of the train set) as described in the paper. I trained the model for 1000k steps and checked the output which resulted to garbled speech. I then extended the training (without changing the hparams) to 1600k steps but still no improvements. Sample synthesized audio files (and the hparams file) can be found here: Link.

Any help or insights on how I could continue would be very much appreciated. Thanks!

r9y9 / wavenet_vocoder

Multi-speaker TTS with ESPnet mel-spectrograms #209