Adding nuance to "Observe spoken text"

https://github.com/bocoup/aria-at-automation#observe-spoken-text

Scoping speech metadata sent to the TTS

While exploring NVDA source and the TTS-engine side of the SAPI 5.4 API, I realized that screen readers send much more than basic speech strings to be spoken by the TTS. In the case of SAPI, NVDA sends SSML, which ISpTTSEngine::Speak receives in SVPSTATE.

Metadata includes:

LangID - the language associated with the whole or section of an announcement, which the TTS can use to adjust vocalization. We could use this to test that NVDA correctly processes a multi-language web page
EmphAdj - Not sure this is used, but presumably could ensure that <em> semantics are picked up and conveyed by the screen reader
PitchAdj - Could test that NVDA is correctly increasing pitch for capital letters
SilenceMSecs - Via the SSML <silence> tag, NVDA inserts this for BreakCommands. Could be used to test appropriate cadance
There's also SPVACTIONS, which include SPVA_Pronounce and SPVA_SpellOut. I think NVDA provides it's own spelling functionality, but does appear to use <pron>

Should "observe spoken text" should include this level of detail?

Technical speech observation solution scopes to a particular TTS API

I realized looking through NVDA's source that it has many synthDrivers, currently for SAPI 4, SAPI 5, OneCore, and eSpeak. Our SAPI 5 driver only tests NVDA's code path for SAPI 5.

Is it worth documenting this... tradeoff?

Pragmatically, I think the chance of finding a bug in a specific TTS driver is low, and finding a comprehensive solution probably isn't worth the effort. That said, the drivers do have some complexity. synthDrivers/oneCore.py maintains its own queue. All 3 have different SSML algorithms (looking at commit history, espeak seems to allow malformed SSML while OneCore rejects it).

@WestonThayer This is great information; thanks for carrying out the research and writing it up.

Keep in mind that a virtual system-level (i.e. SAPI5 on Windows) engine is only one of the paths that will be investigated going forward. It is likely that screen-reader-specific code will also be needed to implement parts of the automation driver protocol, and such in-process facilities may also involve capturing the speech before it even leaves the screen reader's boundaries, e.g. with a "tee"-like synth driver to allow speech to be captured while also speaking it out loud for developers and/or testers. That would make use of similar things to what you've outlined here, albeit SR-specific internal ones, e.g. NVDA's formatting/command fields.

w3c / at-driver

Adding nuance to "Observe spoken text" #6

Scoping speech metadata sent to the TTS

Technical speech observation solution scopes to a particular TTS API