why Tacotron2 mel are sometimes negative?

vcjob commented 4 years ago

Hello everyone!

I mentioned that the mel spectrogram generated by the Tacotron model (obtained with

waveform, alignment, decoder_outputs, postnet_output, stop_tokens = synthesis(
        model, text, C, use_cuda, ap, False, C.enable_eos_bos_chars)
vocoder_input = torch.FloatTensor(postnet_output.T).unsqueeze(0)

where _vocoderinput is, as I suppose, mel-spectrogram) sometimes (always) has negative values (very small ones, like -0.037 or so). If we compute the ground-true spectrogram with

ap = AudioProcessor(**CONFIG.audio)
mel = ap.melspectrogram(wav).astype(np.float32)

we have always positive values, though. So if we compare ground-truth mels with generated by Tacotron2 ones, the ground-truth's are always a bit bigger, like 0.550 comparing to 0.450 and so on. So, here's my question: Why is that? May I get this around? I want to train melgan and generate the mels NOT by tacotron2, but with AudioProcessor. It is the right way to go? I am worried about those negative values, as they are not going to be presented in the training set for melgan.

Thank you!

erogol commented 4 years ago

How do you compare the values? Do you check values on tensorboard? Also please format your question and put your code in code segment like below.

print("This is python and I am not lazy")

vcjob commented 4 years ago

How do you compare the values? Do you check values on tensorboard? Also please format your question and put your code in code segment like below.
print("This is python and I am not lazy")

actually, I just glance at the ground-truth and glance at the one generated by Tacotron2. I saw they follow the same pattern (see picture below). They differ not in constant value, well, tacotron2 doesn't have to be 100 % precise to the ground-truth. But it has values below 0, while ground-truth never has those. Well, Tacotron2 is a bit longer, which is also fine.

For Tacotron2 I provided exact the same text, which is pronounced the ground-truth file.

erogol commented 4 years ago

interesting find. Could you explain me how you computed sample.npy? I understand that it is an output of tacotron2 but output of which function to be exact so that I can debug the problem better.

vcjob commented 4 years ago

interesting find. Could you explain me how you computed sample.npy? I understand that it is an output of tacotron2 but output of which function to be exact so that I can debug the problem better.

Sure. What I did was just saved as numpy the vocoder input from synthesize.py (see picture; I decided to attach a picture instead of the code since it seems more clear to me).

erogol commented 4 years ago

It has negatives since Mel spec is normalized into a range. So TTS model dies not use raw specs.ot first converts to decibel and then norm them in a certain range . (Default config is -4 to 4)

vcjob commented 4 years ago

It has negatives since Mel spec is normalized into a range. So TTS model dies not use raw specs.ot first converts to decibel and then norm them in a certain range . (Default config is -4 to 4)

I see. So, to train WaveRNN or other vocoder, it's better to use mels from trained TTS model (vocoder_input in the code above)? But in this case, what audio should I use as a ground-truth? The one generated with TTS? It has low quality, that's why I try to use WaveRNN. The one from dataset for TTS? But as I already metioned, if we compare the mel-spectrograms, the one generated by TTS is longer (so it is supposed to generate a bit longer audio).

Immortalin commented 4 years ago

What about MelGAN? Is the default Mel suitable for using with Lyrebird's MelGan vocoder?

erogol commented 4 years ago

@vcjob I'd suggest to use TTS outputs to train WaveRNN

Immortalin commented 4 years ago

@erogol any suggestions for MelGan?

erogol commented 4 years ago

@Immortalin haven't tried yet

mozilla / TTS

why Tacotron2 mel are sometimes negative? #339