Open swapsmagic opened 2 months ago
That's right, the disadvantage is too big.
Please see my attempt at alleviating this issue in this PR: https://github.com/huggingface/parler-tts/pull/110
The idea is to try and "prefix" the TTS with some audio to get it to mimick the prosody as it generates more audio.
(edit) I've also added a notebook in the PR to demonstrate how you could do this.
Tried
Laura speaks slightly faster than normal with slightly expressive monotone voice with a hint of excitement.
with different text and the voice is drastically different. How can this be fixed? Is there a specific technique that helps keep the voice consistent?