Closed mazzzystar closed 5 years ago
@mazzzystar I've never tried WaveNet. My work has been around WaveRNN mostly.
You are right that spectrograms created by the current TTS master might not meet the needs of neural vocoders. Maybe the last shared model might have a chance.
If you wait a bit more, I plan to share the model described here soon #26 . It works at least with WaveRNN. That's I can fairly say.
@erogol can't wait to see your achievement.
@erogol
I review the code difference betweenTTS: master
and TTS:dev-taco2
branch, and wonder whether the reason for TTS
melspectrogram failure on Vocoder is because we optimize both the liner_loss
and mel_loss
as code below ?
https://github.com/mozilla/TTS/blob/5acc9db4ac95bb014fa04fdeb473c6d8ad09fb23/train.py#L136-L141
If we only optimize the mel_loss
, could it possible for TTS
to generate high quality melspectrogram ?
And yet there is another issue at the dev-taco2
branch.
https://github.com/mozilla/TTS/blob/cf11b6c23c4ac0aa5e5b394656b3e214311c4d8b/train.py#L126-L132
Here you compute the the L1loss
for decoder_output
and postnet_output
with mel_input
at the same time. Can you explain about why ? As in my cognition:
(1) decoder_output, stop_tokens, alignments = self.decoder(encoder_outputs, mel_specs, mask)
(2) postnet_output = self.postnet(decoder_output)
(3) postnet_output = decoder_output + postnet_output
So if we try to minimize the L1loss(decoder_output, mel_input)
and L1loss(postnet_out, mel_input)
at the same time, then code in line(2) should output as close as posible to 0
.
@mazzzystar I've tried once to optimize Tacotron for only mel-spectrogram and couldn't get good results. But maybe there is a space to investigate more.
@mazzzystar Why do you think line 2 should output 0? My feeling is that postnet tries to learn fine-grain information that is missing right after the decoder. And if you compare these two outputs, you also see that there is an important loss difference between postnet_output and decoder_output in the inference time.
The reason I think that is, you calculate the loss on 2 different output with the same ground-truth. It's reasonable for me to compare only the postnet
output(code line(3)) with the true melspectrogram, and backward the loss to update the whole encoder
, decoder
and postnet
.
That is to say, if you already know that the output from decoder
is imperfect, and it will optimized it in postnet
, then why you want L1loss(mel_input, decoder_output)
to be as near as 0
?
@mazzzystar for Tacotron the reason was that, linear output has too much redundancy and so it prevents decoder to learn reduments. Therefore it uses mel-spectrograms as decoder output. Then the postnet only needs to learn projecting mel-spec to linear which is possibly an easier task to learn. Also you can change the postnet as some point with a better alternative as you keep the rest the same.
For Tacotron2 the ideas is similar but not the same. So with the decoder we learn a rough spectrogram representation that enables decoder to also learn the alignment. Then we train the postnet to only learn the fine details. If we'd use a sinlge loss function with the ultimate network output we couldn't force the network to have this kind of modularization. To be more concrete, here are the tacotron2 outputs for the decoder and the postnet. It is visually clear what I mean.
Final output
Postnet output
Thanks for the clarification, I'm now get some points of your idea.
So you mean in Tacotron1
, the Decoder
tries to output the mel-spectrogram, and the PostCBHG
is only trying to project the mel-spectrogram to linear-spectrogram, so we need both make sure <mel_output, mel_input> and <linear_output, linear_output> be the same.
While in Tacotron2
, the Decoder
is already good for getting the main part of the mel-spectrogram(not the linear-spectrogram), so the Postnet
here is only to add some "texture" for getting better mel-spectrogram result, right ?
If I want to feed the mel-output to the Vocoder
(e.g, WaveNet), it's better to use the final output rather than the Decoder
output, right ?
@mazzzystar yep all is true.
For tacotron2, yes you should use the final network output but I've not tried linear specs of tacotron1. So you might give a try.
@erogol
Which kind of vocoder
you used on the conclusion spectrograms created by the current TTS master might not meet the needs of neural vocoders
? I recently tried TTS
and Tacotron2
branch with r9y9's wavenet_vocoder, and suddenly realized that, I set different preprocessing parameters for TTS and wavenet_vocoder. Have you experimented on exactly the same parameters for two model ?
@mazzzystar Yes exactly the same parameter. My vocoder fork is here https://github.com/erogol/WaveRNN. You can pass the config used by TTS to WaveRNN and it works.
I will try to fit the parameters of TTS/Tacotron2 with r9y9's wavenet_vocoder and tell you my results on WaveNet Vocoder.
@erogol Could WaveRNN do a great job on 48000kHz wav ?
@tsungruihon 16-22 kHz would be enough. Never tried 48K
@mazzzystar can you share your findings with TTS/Tacotron2 and wavenet vocoder
Thanks for your work ! I use
tts
melspectrogram output directly as the input of r9y9's wavenet_vocoder pretrained model in order to get better quality, but it shows up that WaveNet Vocoder works fine on ground-truth melspectorgram, but performs badly ontts
melspectrogram output.I noticed you've also tried and meet similar situation, and you think it's beacause the quality of
tts
melspectrogram is not as high as required, could you please explain more detail about that ?So as a conclusion, currently the most possible vocoder to combining with
tts
is:Is that right ? I'm still confused why I could get a relatively good synthesized audio from
tts
mel/linear spectrogram with GL, but use the melspectrogram as the input of WaveNet Vocoder get worse results. Below are the comparison samples. gl_wv.zip