johnfactotum / foliate

Read e-books in style
https://johnfactotum.github.io/foliate/
GNU General Public License v3.0
5.25k stars 254 forks source link

Add Spiel speech API #1335

Open eeejay opened 3 weeks ago

eeejay commented 3 weeks ago

Spiel is a modern speech synthesis API for the desktop that will hopefully support many kinds of providers and voices. It has GI bindings, so adding it to foliate shouldn't be hard.

I started a port, but ran into some trouble with how foliate creates SSML mark elements to report speech progress. Spiel has speech boundary events, including SSML marks, but now all providers support it. Unfortunately I think changes will be needed in foliate-js as well to make this work. Specifically, pre-segmenting the text into marks gets in the way here.

johnfactotum commented 3 weeks ago

If you mean supporting providers that don't support mark events but do support word boundary events, that is indeed not something currently supported by foliate-js. I do plan on adding this, since SSML support seems to be problematic in browsers. (Ideally I'd like to switch to the Web Speech API once it's supported by WebKitGTK.)

One slight snag is that the marks are currently also used to implement "speak from here" and pausing, mainly to ensure that the speech always begins from word boundaries. Maybe it could do without this (or decouple this from the speech text), in which case it shouldn't be too hard to do without marks. Just need to maintain a text walker instance (see https://github.com/johnfactotum/foliate-js/blob/main/text-walker.js) to convert string offsets to ranges.

eeejay commented 3 weeks ago

It makes sense moving into the web view. Happy to see that it is at least proposed in WebKitGTK.

Note: because of the way spiel works there is no need to track the speech progress for pausing purposes. Calling pause will simply pause the stream. So I think SSML marks would need to stay exclusively for speech dispatcher support. And yeah, we would need to de-serialize the string offset to a DOM range, which may or may not be trivial?

johnfactotum commented 3 weeks ago

Calling pause will simply pause the stream.

Ideally I think it should be capable of rewinding to the start of the last word or sentence other than simply pausing the stream. The current behavior in Foliate isn't good either because it simply restarts from a word boundary, which would result in incorrect pronunciation or intonation. The same goes for "speak from here" (i.e. starting speech from a user-selected position).

Avoiding part words isn't really an important feature, though. Mostly it's just what you get "for free" since the text is already segmented. It's probably fine to just drop it.

And yeah, we would need to de-serialize the string offset to a DOM range, which may or may not be trivial?

For plain text it should be okay. It might be non-trivial when using SSML, when the offsets are that of the SSML source string, because in general mapping source offsets to nodes is difficult if not impossible with the browser's DOM APIs. But since the XML here is controlled and rather simple, maybe one can just count the number of characters between < and > and adjust the offsets accordingly.