I have some question regarding quality of Fastspeech2 output compared to Glow TTS. Currently I am using Glow TTS generated Mels with HifiGan vocoder and quality is good. There is scope of improvement in prosody. Tacotron2 works better in this regard but has high inference time as well as performs poorly when input sentence length increases. Fastspeech2's inference speed is faster that of Glow TTS but given that contribution of TTS is small compared to time taken by vocoder. I am rather interested in knowing whether Fastspeech2 would help increase quality in terms of intonation, pauses and stress of output sentences? Does anyone here trained both using Glow TTS vs Fastspeech2?
I have some question regarding quality of Fastspeech2 output compared to Glow TTS. Currently I am using Glow TTS generated Mels with HifiGan vocoder and quality is good. There is scope of improvement in prosody. Tacotron2 works better in this regard but has high inference time as well as performs poorly when input sentence length increases. Fastspeech2's inference speed is faster that of Glow TTS but given that contribution of TTS is small compared to time taken by vocoder. I am rather interested in knowing whether Fastspeech2 would help increase quality in terms of intonation, pauses and stress of output sentences? Does anyone here trained both using Glow TTS vs Fastspeech2?