ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
1.81k stars 531 forks source link

Some confusion in your visualization TensorBoard #4

Closed v-nhandt21 closed 4 years ago

v-nhandt21 commented 4 years ago

When I train model FastPitch from NVIDIA source code, I have the same images like yours, My train data is 11239 and validation is 1000, I have seen that the train line and val line more and more separate it others. It seems not common Are your models really output a speech or not? I feel so confused? Thank you for sharing the code <3

ming024 commented 4 years ago

I am not sure of the meaning of the sentence "Are your models really output a speech or not?" The output of my FastSpeech2 implementation is a mel-spectrogram, which can be conversed to wavfiles by vocoders such as WaveGlow and MelGAN.

If you are asking that why there is a large gap between the training and validation mel_loss and mel_postnet_loss curves, it is because that in evaluate.py the model synthesizes mel-spectrograms without ground-truth F0 and energy labels.

https://github.com/ming024/FastSpeech2/blob/172d2ea9a03b8dc6751388bacc26c15c801cbb4d/evaluate.py#L63-L64

ming024 commented 4 years ago

closed #4