rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
6.73k stars 494 forks source link

Duration and buffer control for inference endpoints #556

Open Vibrat opened 4 months ago

Vibrat commented 4 months ago

Context: I'm trying to call piper on a stream application where each chunk of wav audio is sent upon readiness to clients for reducing downloading time. My understanding is that PiperVoice.synthesize_stream_raw give me the chunk audio in bytes that i want without wav headers (nframes, framerate, ...). This is great but to stream partial results from piper to clients, I also need a way to control the speed / duration of the generated audio, just to derive the nframes.

Problem: The problem is that header info sending at the beginning contains nframes of 0, simply because we don't know how long the generated audio would be. Is there any way to control either speed per word or duration of the audio? This would help derive the nframes in advance.

image

My streaming flow from server to client works as follows:

tell client about file metadata

header_buf = create_wav_header() sio.emit("piper_assets", ({"tsid": body.tsid, "eventID": event_id}, header_buf), to=sid)


- Then generate each chunk of speech from `piper` and deliver ti client by raw bytes in order. 

```python
# start sending chunks of data
# tts(chunks) here just call PiperVoice.synthesize_stream_raw inside
for audio in tts(chunks):
    sio.emit("piper_assets", ({"tsid": body.tsid, "eventID": event_id}, audio), to=sid)