Open hertz-pj opened 2 years ago
Hi @hertz-pj , good point. I would say it depends on the purpose. For example, you'd choose FastSpeech2 If you need fast and safe performance. It goes to DiffSpeech if you want randomness and non-metalic speech in the output. If the interest is in both speed and randomness, PortaSpeech can be satisfying you.
@hertz-pj This is old, but just putting it there in case someone is searching for a comparison.
If you want to compare inference only, you can simple download pretrained models and run inference (even better if they are hosted on HuggingFace -- you can try directly).
For training, i haven't trained DiffSpeech, but FastSpeech2 trains 5-10x faster for the same comparable audio quality. FS2 takes under 2 hours on a single RTX 3090 to produce totally intelligible speech. However, PortaSpeech has more prosody variation.
From your experience, how are the effects of these models ranked.