huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
4.52k stars 457 forks source link

Speaker voice is not consistent across different generation #112

Open swapsmagic opened 2 months ago

swapsmagic commented 2 months ago

Tried Laura speaks slightly faster than normal with slightly expressive monotone voice with a hint of excitement. with different text and the voice is drastically different. How can this be fixed? Is there a specific technique that helps keep the voice consistent?

jdola commented 2 months ago

That's right, the disadvantage is too big.

Guppy16 commented 2 months ago

Please see my attempt at alleviating this issue in this PR: https://github.com/huggingface/parler-tts/pull/110

The idea is to try and "prefix" the TTS with some audio to get it to mimick the prosody as it generates more audio.

(edit) I've also added a notebook in the PR to demonstrate how you could do this.