Context: I'm trying to call piper on a stream application where each chunk of wav audio is sent upon readiness to clients for reducing downloading time. My understanding is that PiperVoice.synthesize_stream_raw give me the chunk audio in bytes that i want without wav headers (nframes, framerate, ...). This is great but to stream partial results from piper to clients, I also need a way to control the speed / duration of the generated audio, just to derive the nframes.
Problem: The problem is that header info sending at the beginning contains nframes of 0, simply because we don't know how long the generated audio would be. Is there any way to control either speed per word or duration of the audio? This would help derive the nframes in advance.
My streaming flow from server to client works as follows:
Upon a client request, create buffer of wav header and send to client.
def create_wav_header():
"""Create header bytes data for WAV file"""
header_buf = io.BytesIO()
with wave.open(header_buf, "wb") as wave_file:
# pylint: disable=no-member
wave_file.setframerate(sample_rate)
wave_file.setsampwidth(2) # 16-bit
wave_file.setnchannels(1) # mono
wave_file.writeframes(b'')
return header_buf.getvalue()
- Then generate each chunk of speech from `piper` and deliver ti client by raw bytes in order.
```python
# start sending chunks of data
# tts(chunks) here just call PiperVoice.synthesize_stream_raw inside
for audio in tts(chunks):
sio.emit("piper_assets", ({"tsid": body.tsid, "eventID": event_id}, audio), to=sid)
Context: I'm trying to call
piper
on a stream application where each chunk ofwav
audio is sent upon readiness to clients for reducing downloading time. My understanding is thatPiperVoice.synthesize_stream_raw
give me the chunk audio in bytes that i want withoutwav
headers (nframes, framerate, ...). This is great but to stream partial results frompiper
to clients, I also need a way to control the speed / duration of the generated audio, just to derive thenframes
.Problem: The problem is that header info sending at the beginning contains
nframes
of0
, simply because we don't know how long the generated audio would be. Is there any way to control either speed per word or duration of the audio? This would help derive thenframes
in advance.My streaming flow from server to client works as follows:
Upon a client request, create buffer of
wav
header and send to client.tell client about file metadata
header_buf = create_wav_header() sio.emit("piper_assets", ({"tsid": body.tsid, "eventID": event_id}, header_buf), to=sid)