mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.41k stars 1.26k forks source link

Use the melspectrogram as the input of WaveNet Vocoder seems failed #128

Closed mazzzystar closed 5 years ago

mazzzystar commented 5 years ago

Thanks for your work ! I use tts melspectrogram output directly as the input of r9y9's wavenet_vocoder pretrained model in order to get better quality, but it shows up that WaveNet Vocoder works fine on ground-truth melspectorgram, but performs badly on tts melspectrogram output.

I noticed you've also tried and meet similar situation, and you think it's beacause the quality of tts melspectrogram is not as high as required, could you please explain more detail about that ?

So as a conclusion, currently the most possible vocoder to combining with tts is:

Is that right ? I'm still confused why I could get a relatively good synthesized audio from tts mel/linear spectrogram with GL, but use the melspectrogram as the input of WaveNet Vocoder get worse results. Below are the comparison samples. gl_wv.zip

erogol commented 5 years ago

@mazzzystar I've never tried WaveNet. My work has been around WaveRNN mostly.

You are right that spectrograms created by the current TTS master might not meet the needs of neural vocoders. Maybe the last shared model might have a chance.

If you wait a bit more, I plan to share the model described here soon #26 . It works at least with WaveRNN. That's I can fairly say.

OswaldoBornemann commented 5 years ago

@erogol can't wait to see your achievement.

mazzzystar commented 5 years ago

@erogol I review the code difference betweenTTS: master and TTS:dev-taco2 branch, and wonder whether the reason for TTS melspectrogram failure on Vocoder is because we optimize both the liner_loss and mel_loss as code below ?

https://github.com/mozilla/TTS/blob/5acc9db4ac95bb014fa04fdeb473c6d8ad09fb23/train.py#L136-L141

If we only optimize the mel_loss, could it possible for TTS to generate high quality melspectrogram ?

mazzzystar commented 5 years ago

And yet there is another issue at the dev-taco2 branch. https://github.com/mozilla/TTS/blob/cf11b6c23c4ac0aa5e5b394656b3e214311c4d8b/train.py#L126-L132 Here you compute the the L1loss for decoder_output and postnet_output with mel_input at the same time. Can you explain about why ? As in my cognition:

(1) decoder_output, stop_tokens, alignments = self.decoder(encoder_outputs, mel_specs, mask)
(2) postnet_output = self.postnet(decoder_output)
(3) postnet_output = decoder_output + postnet_output

So if we try to minimize the L1loss(decoder_output, mel_input) and L1loss(postnet_out, mel_input) at the same time, then code in line(2) should output as close as posible to 0.

erogol commented 5 years ago

@mazzzystar I've tried once to optimize Tacotron for only mel-spectrogram and couldn't get good results. But maybe there is a space to investigate more.

@mazzzystar Why do you think line 2 should output 0? My feeling is that postnet tries to learn fine-grain information that is missing right after the decoder. And if you compare these two outputs, you also see that there is an important loss difference between postnet_output and decoder_output in the inference time.

mazzzystar commented 5 years ago

The reason I think that is, you calculate the loss on 2 different output with the same ground-truth. It's reasonable for me to compare only the postnet output(code line(3)) with the true melspectrogram, and backward the loss to update the whole encoder, decoder and postnet.

That is to say, if you already know that the output from decoder is imperfect, and it will optimized it in postnet, then why you want L1loss(mel_input, decoder_output) to be as near as 0 ?

erogol commented 5 years ago

@mazzzystar for Tacotron the reason was that, linear output has too much redundancy and so it prevents decoder to learn reduments. Therefore it uses mel-spectrograms as decoder output. Then the postnet only needs to learn projecting mel-spec to linear which is possibly an easier task to learn. Also you can change the postnet as some point with a better alternative as you keep the rest the same.

For Tacotron2 the ideas is similar but not the same. So with the decoder we learn a rough spectrogram representation that enables decoder to also learn the alignment. Then we train the postnet to only learn the fine details. If we'd use a sinlge loss function with the ultimate network output we couldn't force the network to have this kind of modularization. To be more concrete, here are the tacotron2 outputs for the decoder and the postnet. It is visually clear what I mean.

Final output image

Postnet output image

mazzzystar commented 5 years ago

Thanks for the clarification, I'm now get some points of your idea. So you mean in Tacotron1, the Decoder tries to output the mel-spectrogram, and the PostCBHG is only trying to project the mel-spectrogram to linear-spectrogram, so we need both make sure <mel_output, mel_input> and <linear_output, linear_output> be the same.

While in Tacotron2, the Decoder is already good for getting the main part of the mel-spectrogram(not the linear-spectrogram), so the Postnet here is only to add some "texture" for getting better mel-spectrogram result, right ?

If I want to feed the mel-output to the Vocoder(e.g, WaveNet), it's better to use the final output rather than the Decoder output, right ?

erogol commented 5 years ago

@mazzzystar yep all is true.

For tacotron2, yes you should use the final network output but I've not tried linear specs of tacotron1. So you might give a try.

mazzzystar commented 5 years ago

@erogol Which kind of vocoder you used on the conclusion spectrograms created by the current TTS master might not meet the needs of neural vocoders? I recently tried TTS and Tacotron2 branch with r9y9's wavenet_vocoder, and suddenly realized that, I set different preprocessing parameters for TTS and wavenet_vocoder. Have you experimented on exactly the same parameters for two model ?

erogol commented 5 years ago

@mazzzystar Yes exactly the same parameter. My vocoder fork is here https://github.com/erogol/WaveRNN. You can pass the config used by TTS to WaveRNN and it works.

mazzzystar commented 5 years ago

I will try to fit the parameters of TTS/Tacotron2 with r9y9's wavenet_vocoder and tell you my results on WaveNet Vocoder.

OswaldoBornemann commented 5 years ago

@erogol Could WaveRNN do a great job on 48000kHz wav ?

erogol commented 5 years ago

@tsungruihon 16-22 kHz would be enough. Never tried 48K

m-hamza-mughal commented 4 years ago

@mazzzystar can you share your findings with TTS/Tacotron2 and wavenet vocoder