rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
6.55k stars 478 forks source link

Increasing the realism of Piper's TTS #204

Open talkbotintegrations opened 1 year ago

talkbotintegrations commented 1 year ago

Hi! This is a fantastic project and opens many new doors. Would you say it's possible to incorporate different emotions in speech synthesis by swapping between different emotion models of a trained voice? In a personal project of my own, I was thinking of pairing this with an LLM to create stop points in the text, indicating when to switch the emotional model to match the intention of the text.

If there's a more built-in way to achieve this that you might have planned, I'd love to hear it! Piper is amazing, but lacks the ability to pause/delay and also change emotion in what's being said.

Lastly, is there any foreseeable way to incorporate non-verbal utterances in the synthesized audio? I'm hoping for something similar to Suno's Bark, but with the speedy inference of Piper.

flatsiedatsie commented 9 months ago

interesting idea!