bi1101 commented 2 months ago

Hi,

I noticed that the repository uses a really smart approach to account for text length by segmenting the text into smaller chunks and generating audio for each segment separately.

But there's a small issue. The part where the stitching happens is very noticeable. When the audio chunks are stitched together, there are no gaps between them, which makes the resulting speech sound a bit rushed and unnatural.

Can you add this in future versions? Making it configurable via .env would also be nice.

matatonic commented 2 months ago

Are you experiencing this with piper or xtts models? both? (any more details about which voice settings you're using are also helpful for me to test with)

bi1101 commented 2 months ago

I'm experiencing this with xtts, (the piper is rushed by default in my experience, so this is not an issue). With every 2 sentence or so, the audio is kinda rushed to the next sentence without a pause. The pace remain similar between different voices This is the audio file response.webm

And the curl

{
    "model": "tts-1-hd",
    "input": "Once upon a time in a small village nestled between rolling hills, there lived a young girl named Lila. She was known for her curiosity and love for the stars. Every night, she would climb up the hill behind her house to gaze at the twinkling lights in the sky. Her favorite star was the brightest one, which she named Lumina. One evening, as she was lying on the grass, Lila noticed Lumina flickering strangely. Concerned, she whispered, 'What's wrong, Lumina?' To her amazement, the star responded with a soft glow and a gentle voice, 'I am losing my light, Lila. I need your help.' Determined to save her beloved star, Lila asked, 'What can I do?' Lumina explained that a dark shadow from a distant galaxy was slowly dimming her light. To restore it, Lila would need to gather the light of the purest hearts in her village and send it to Lumina. The next day, Lila set out on a quest. She visited her neighbors, friends, and even the animals in the village, asking for their purest wishes and hopes. She collected them in a small glass jar that sparkled with a soft, golden light. With the jar full, Lila returned to the hill at sunset. Holding the jar high, she whispered a wish for Lumina to shine brightly again. The jar burst open, releasing a beam of golden light that shot up into the sky, enveloping Lumina. The star began to glow with renewed brilliance, brighter than ever before. Lumina twinkled joyfully and thanked Lila for her kindness and bravery. From that night on, Lumina became a guiding star for travelers and a symbol of hope for the village. Lila continued to visit Lumina every night, knowing that the light of a pure heart could brighten even the darkest of skies.",
    "voice": "alloy",
    "response_format": "mp3",
    "speed": "1"
}

matatonic commented 2 months ago

Pretty subtle, but yeah, I hear it. Have you considered using speed: 0.9?

There is no built in option to add silence in xtts, so the change is fiddling with the wav output in the stream. This will probably not happen by me unless it gets more support, but I'm open to a PR for it.

bi1101 commented 2 months ago

55 This should mitigate the issue

matatonic commented 2 months ago

Funny, just yesterday I figured this out also (silence was 0s,).

matatonic / openedai-speech

Add Small Gaps Between Audio Chunks to Avoid Rushed Speech #53

55 This should mitigate the issue