rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
5.98k stars 434 forks source link

No reading tracking with Piper speech synthesis #361

Open patrick-emmabuntus opened 8 months ago

patrick-emmabuntus commented 8 months ago

Hello,

I used Calibre 6.13 with ebook-speaker on Debian 12.

The goal is to allow blind people to listen to the content of ebooks.

In order to have a better reading, I want to replace eSpeak-ng with Piper. Playback with Piper works well but this one compared to eSpeak-ng does not review the playback tracking data in Calibre like eSpeak-ng does see screenshot below.

In the Speak-ng synthesizer engine the "EspeakIndexing" option is set to 1 which activates word tracking.

This function is very important because it allows when reopening an ebook to return to where it left off because Calibre followed the voice reading.

Do you know if such a function is available in Piper?

And if so, how to activate it?

Thank you in advance for your advice.

Calibre_espeak_ng_scroll_speech

SeymourNickelson commented 7 months ago

This would be an awesome feature to have. It should be possible to add code to synthesize each word independently of each other and provide a callback just before the audio is played on each word boundary, but I would assume that the voice wouldn't sound as realistic because you're feeding the model one word at a time.

I wonder if there is another way to highlight on words as they are played without impacting the quality of the output.

patrick-emmabuntus commented 7 months ago

Thank you @SeymourNickelson for your advice.

Indeed, if the words are read one by one, this will alter the reading of the voice synthesis.

On the other hand, you must continue to read the words normally using voice synthesis and you must send a reading position to caliber so there may be a small gap between the word read and the word displayed in Caliber. The goal is to allow the blind person to return to the position where they were in the book during the previous reading and not have to reread the entire chapter from the last chapter read.

contentnation commented 7 months ago

I had time to into the way Calibre works. Sadly, I got bad news. Short version: Calibre uses speech-dispatch for generating the audio. You can add custom tools for text-to-speech (like piper). But for the highlighting feature you need to add direct support for piper in speech-dispatch to add "magic". Plus some work on piper side for the other part of the magic.

For those, who want to go on developing, a few notes (or TODOs): speech-dispatch needs similar marker functionality as in src/modules/espeak.c: As soon as such a marker is received, wait for the audio data and tell upstream about the marker. On the piper side, the markers need to used to split the input and if a marker is reached, send the generated audio timestamp and audio data until that point. Current generic output always filters those markers before it is sent to piper (or any external tts).

patrick-emmabuntus commented 7 months ago

Thank you very much @contentnation for your advice.