Questions on E2E-TTS demo

kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

https://kan-bayashi.github.io/ParallelWaveGAN/

MIT License

1.54k stars 340 forks source link

Questions on E2E-TTS demo #136

Closed dawdleryang closed 4 years ago

dawdleryang commented 4 years ago

Dear Kan-Bayashi,

Got two questions on your E2E-TTS mandarin demo on colab. Can help here?

Download pretrained models

You can select Transformer or FastSpeech.

1) For mandarin demo, you only provide options of Transformer and FastSpeech, but don't have Tacotron2, is there any recipe for train Tacotron2 using espnet, and to combine it with your parallel-wavegan? 2) How did you train your transformer and fastspeech models, is there any recipe I can follow to reproduce it?

In other words, my main concern is how to combine tacotron2/transformer/fastspeech with parallele-wavegan? how to make the melspectrom common in both models.

Thank you very much.

kan-bayashi commented 4 years ago

Of course, you can do it by just changing the training config file. See the config files. https://github.com/espnet/espnet/tree/master/egs/csmsc/tts1/conf/tuning
Yes, you can train by just changing the training config file, but you need to train the teacher model at first by using the transformer or tacotron 2 config. https://github.com/espnet/espnet/blob/master/egs/csmsc/tts1/conf/tuning/train_fastspeech.v3.single.yaml

By using recipes and changing the config files, you can build an arbitrary network. What you need to be careful is that feature setting (Mel range, window size, shift size, etc.). You need to use the same feature setting for both Text2Mel and Mel2Wav model.

dawdleryang commented 4 years ago

Got it, really appreciated.

LuoDQ commented 4 years ago

In the E2E-TTS demo in colab, we just load different text-to-mel pretrained models and vocoders, and the text is pronounced correctly. So what mel data do you use to train the vocoders in colab, only the groundtruth melspectrograms or mels from some specific text-to-mel model?

Besides, it seems the vocoders suit melspectrograms generated by different text-to-mel model while the vocoder is trained using mels generated by one specific text-to-mel model . I'm curious about it. @kan-bayashi

kan-bayashi commented 4 years ago

@LuoDQ I always use groundtruth of mel-spectrogram to train the vocoder to make it simple. Of course we can use generated mel-spectrogram using teacher-forcing to improve the quality. But in the most of cases, the use of groundtruth of mel-spectrogram is enough.

LuoDQ commented 4 years ago

@kan-bayashi I notice that in the E2E-TTS demo, there is no normalization for the mels output by text-to-mel model. Is that correct?

kan-bayashi commented 4 years ago

Yes. Because the outputs of text2mel model is normalized mel-spectrogram and the input of the vocoder is normalized mel-spectrogram. And we use the same statistics for both text2mel and vocoder models.

ahmed-fau commented 4 years ago

@kan-bayashi For the pretrained transformer.v3 and PWGAN vocoder in case of LJspeech, the f_min and f_max are 80 and 7600 respectively. Is there a pretrained WaveNet model (e.g. r9y9) with the same f_min and f_max in order to run it too??

I think the pretrained r9y9 WaveNet vocoder has f_min=125 which is different from the setting in this demo. I don't know the main reason for mel hyper-parameter change, but it would be really great if the WaveNet vocoder can be supported also in this nice demo.

Hope if there is already something like this in Espnet examples.

kan-bayashi commented 4 years ago

You can get from https://github.com/espnet/espnet#tts-results

ahmed-fau commented 4 years ago

Yes. Because the outputs of text2mel model is normalized mel-spectrogram and the input of the vocoder is normalized mel-spectrogram. And we use the same statistics for both text2mel and vocoder models.

@kan-bayashi is this also applied for the pretrained WaveNet vocoder you provided in this link ? if so, how can I apply a copy/synthesis operation using this trained vocoder? do you save the normalization parameters (mean & var) somewhere?

kan-bayashi commented 4 years ago

is this also applied for the pretrained WaveNet vocoder you provided in this link ?

Yes, the same manner was used.

do you save the normalization parameters (mean & var) somewhere?

Sorry, the link does not contain stats parameters. But you can use PWG's one since the training data is the same.

ahmed-fau commented 4 years ago

Hi @kan-bayashi ,

I have tried to use the pretrained transformer v3 model of ESPnet to create mel features correspond to an English paragraph. I followed the same steps illustrated in the colab notebook. However, the mel gets corrupted after the first sentence in the paragraph and that's why the provided vocoder (e.g. pwgan) cannot synthesize the speech waveform. Any idea how to solve this?

kan-bayashi commented 4 years ago

In general, Transformer is less stable to generate long sentences. Please consider separating each text (Just separate your input at the period). Taco 2 + attention constraint or Non-AR models (FastSpeech or FastSpeech2) may solve this problem.