w3c / at-driver

AT Driver defines a protocol for introspection and remote control of assistive technology software, using a bidirectional communication channel.
https://w3c.github.io/at-driver
Other
31 stars 4 forks source link

Adding nuance to "Observe spoken text" #6

Open WestonThayer opened 2 years ago

WestonThayer commented 2 years ago

https://github.com/bocoup/aria-at-automation#observe-spoken-text

Scoping speech metadata sent to the TTS

While exploring NVDA source and the TTS-engine side of the SAPI 5.4 API, I realized that screen readers send much more than basic speech strings to be spoken by the TTS. In the case of SAPI, NVDA sends SSML, which ISpTTSEngine::Speak receives in SVPSTATE.

Metadata includes:

Should "observe spoken text" should include this level of detail?

Technical speech observation solution scopes to a particular TTS API

I realized looking through NVDA's source that it has many synthDrivers, currently for SAPI 4, SAPI 5, OneCore, and eSpeak. Our SAPI 5 driver only tests NVDA's code path for SAPI 5.

Is it worth documenting this... tradeoff?

Pragmatically, I think the chance of finding a bug in a specific TTS driver is low, and finding a comprehensive solution probably isn't worth the effort. That said, the drivers do have some complexity. synthDrivers/oneCore.py maintains its own queue. All 3 have different SSML algorithms (looking at commit history, espeak seems to allow malformed SSML while OneCore rejects it).

jscholes commented 2 years ago

@WestonThayer This is great information; thanks for carrying out the research and writing it up.

Keep in mind that a virtual system-level (i.e. SAPI5 on Windows) engine is only one of the paths that will be investigated going forward. It is likely that screen-reader-specific code will also be needed to implement parts of the automation driver protocol, and such in-process facilities may also involve capturing the speech before it even leaves the screen reader's boundaries, e.g. with a "tee"-like synth driver to allow speech to be captured while also speaking it out loud for developers and/or testers. That would make use of similar things to what you've outlined here, albeit SR-specific internal ones, e.g. NVDA's formatting/command fields.