Hi, I am sorry for my misunderstanding if it is. But the they are all text-encoder, duration predictor and condition-flow-matching structure for PFlow-TTS and E2-TTS. why E2-TTS will be better? In my experiments before, all models with DP [duration predictor based on TextEncoder outputs] are not good at prosody and naturality.
Hi, I am sorry for my misunderstanding if it is. But the they are all text-encoder, duration predictor and condition-flow-matching structure for PFlow-TTS and E2-TTS. why E2-TTS will be better? In my experiments before, all models with DP [duration predictor based on TextEncoder outputs] are not good at prosody and naturality.