Increasing the realism of Piper's TTS

Hi! This is a fantastic project and opens many new doors. Would you say it's possible to incorporate different emotions in speech synthesis by swapping between different emotion models of a trained voice? In a personal project of my own, I was thinking of pairing this with an LLM to create stop points in the text, indicating when to switch the emotional model to match the intention of the text.

If there's a more built-in way to achieve this that you might have planned, I'd love to hear it! Piper is amazing, but lacks the ability to pause/delay and also change emotion in what's being said.

Lastly, is there any foreseeable way to incorporate non-verbal utterances in the synthesized audio? I'm hoping for something similar to Suno's Bark, but with the speedy inference of Piper.

rhasspy / piper

Increasing the realism of Piper's TTS #204