Closed vcjob closed 4 years ago
How do you compare the values? Do you check values on tensorboard? Also please format your question and put your code in code segment like below.
print("This is python and I am not lazy")
How do you compare the values? Do you check values on tensorboard? Also please format your question and put your code in code segment like below.
print("This is python and I am not lazy")
actually, I just glance at the ground-truth and glance at the one generated by Tacotron2. I saw they follow the same pattern (see picture below). They differ not in constant value, well, tacotron2 doesn't have to be 100 % precise to the ground-truth. But it has values below 0, while ground-truth never has those. Well, Tacotron2 is a bit longer, which is also fine.
For Tacotron2 I provided exact the same text, which is pronounced the ground-truth file.
interesting find. Could you explain me how you computed sample.npy
? I understand that it is an output of tacotron2 but output of which function to be exact so that I can debug the problem better.
interesting find. Could you explain me how you computed
sample.npy
? I understand that it is an output of tacotron2 but output of which function to be exact so that I can debug the problem better.
Sure. What I did was just saved as numpy the vocoder input from synthesize.py (see picture; I decided to attach a picture instead of the code since it seems more clear to me).
It has negatives since Mel spec is normalized into a range. So TTS model dies not use raw specs.ot first converts to decibel and then norm them in a certain range . (Default config is -4 to 4)
It has negatives since Mel spec is normalized into a range. So TTS model dies not use raw specs.ot first converts to decibel and then norm them in a certain range . (Default config is -4 to 4)
I see. So, to train WaveRNN or other vocoder, it's better to use mels from trained TTS model (vocoder_input in the code above)? But in this case, what audio should I use as a ground-truth? The one generated with TTS? It has low quality, that's why I try to use WaveRNN. The one from dataset for TTS? But as I already metioned, if we compare the mel-spectrograms, the one generated by TTS is longer (so it is supposed to generate a bit longer audio).
What about MelGAN? Is the default Mel suitable for using with Lyrebird's MelGan vocoder?
@vcjob I'd suggest to use TTS outputs to train WaveRNN
@erogol any suggestions for MelGan?
@Immortalin haven't tried yet
Hello everyone!
I mentioned that the mel spectrogram generated by the Tacotron model (obtained with
where _vocoderinput is, as I suppose, mel-spectrogram) sometimes (always) has negative values (very small ones, like -0.037 or so). If we compute the ground-true spectrogram with
we have always positive values, though. So if we compare ground-truth mels with generated by Tacotron2 ones, the ground-truth's are always a bit bigger, like 0.550 comparing to 0.450 and so on. So, here's my question: Why is that? May I get this around? I want to train melgan and generate the mels NOT by tacotron2, but with AudioProcessor. It is the right way to go? I am worried about those negative values, as they are not going to be presented in the training set for melgan.
Thank you!