rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
4.57k stars 315 forks source link

ONNX streaming support #255

Closed mush42 closed 1 month ago

mush42 commented 7 months ago

Link to issue number:

Issue #25

Summary of the issue:

Piper uses sentence-level streaming.

For short sentences, the latency of Piper output is relatively low due to the good RTF. But for longer sentences, latency is prohibitively high, which hinders realtime usage applications, such as with screen readers.

Description of how this pull request fixes the issue:

This PR implements streaming output by splitting the VITS model into two parts: encoder and decoder.

First the encoder output is generated for the whole utterance at once, then the encoder output is splitted into chunks of frames using the given chunk size and fed to the decoder chunk by chunk.

To maintain speech quality, each fed chunk is padded with some frames from the previous and next chunk, and then the corresponding wave frames are removed from the final audio output.

To export a checkpoint, use the command:

python3 -m piper_train.export_onnx_streaming CHECKPOINT_PATH ONNX_OUTPUT_DIR

For inference, use the command:

cat input.json | python3 -m piper_train.infer_onnx_streaming --encoder ENCODER_ONNX_PATH --decoder DECODER_ONNX_PATH

Which pipes wave bytes to stdout. You can then redirect the output to any wave playing program.

Testing performed:

Tested export and inference using hfc-male checkpoint.

Known issues with pull request:

The encoder has many components, which can be included in the decoder to further reduce latency, but including those components in the decoder impacts naturalness. There is a trade off to be made between encoder inference speed (latency) and naturalness of generated speech.

For instance, the flow component can be included in the encoder or the decoder. When included in the encoder, it adds significant latency to the encoder. At the same time, chunking the input to the flow component (as a part of the decoder) impacts the speech quality (not verified).

We need to empirically determine which components can be made streamable, and which ones should generate their output at once.

mush42 commented 6 months ago

@synesthesiam There is a living implementation for this in piper-rs repo.

Do you feel positive about merging this?

Best Musharraf

mush42 commented 6 months ago

@synesthesiam I think this is ready for merging.

marty1885 commented 6 months ago

Just dropping by and saying I love this! I've written my own C++ inference server and this is a major issue I met.

marty1885 commented 6 months ago

@mush42 How do I get input.json? I've been trying to generate phoneme IDs manually. But I get no output (0 length in stream). Can you provide an example?

eeejay commented 3 months ago

I don't fully understand everything in this pull request, but I have a feeling that this approach can be used to implement word tracking since the sub-sentence phonemes can be synthesized in chunks. It would be cool if the stream API would be available through a PiperVoice.

mush42 commented 2 months ago

@eeejay Phoneme duration is a better option for word tracking.