Open WestonThayer opened 2 years ago
@WestonThayer This is great information; thanks for carrying out the research and writing it up.
Keep in mind that a virtual system-level (i.e. SAPI5 on Windows) engine is only one of the paths that will be investigated going forward. It is likely that screen-reader-specific code will also be needed to implement parts of the automation driver protocol, and such in-process facilities may also involve capturing the speech before it even leaves the screen reader's boundaries, e.g. with a "tee"-like synth driver to allow speech to be captured while also speaking it out loud for developers and/or testers. That would make use of similar things to what you've outlined here, albeit SR-specific internal ones, e.g. NVDA's formatting/command fields.
https://github.com/bocoup/aria-at-automation#observe-spoken-text
Scoping speech metadata sent to the TTS
While exploring NVDA source and the TTS-engine side of the SAPI 5.4 API, I realized that screen readers send much more than basic speech strings to be spoken by the TTS. In the case of SAPI, NVDA sends SSML, which
ISpTTSEngine::Speak
receives in SVPSTATE.Metadata includes:
LangID
- the language associated with the whole or section of an announcement, which the TTS can use to adjust vocalization. We could use this to test that NVDA correctly processes a multi-language web pageEmphAdj
- Not sure this is used, but presumably could ensure that<em>
semantics are picked up and conveyed by the screen readerPitchAdj
- Could test that NVDA is correctly increasing pitch for capital lettersSilenceMSecs
- Via the SSML<silence>
tag, NVDA inserts this forBreakCommand
s. Could be used to test appropriate cadanceSPVA_Pronounce
andSPVA_SpellOut
. I think NVDA provides it's own spelling functionality, but does appear to use<pron>
Should "observe spoken text" should include this level of detail?
Technical speech observation solution scopes to a particular TTS API
I realized looking through NVDA's source that it has many synthDrivers, currently for SAPI 4, SAPI 5, OneCore, and eSpeak. Our SAPI 5 driver only tests NVDA's code path for SAPI 5.
Is it worth documenting this... tradeoff?
Pragmatically, I think the chance of finding a bug in a specific TTS driver is low, and finding a comprehensive solution probably isn't worth the effort. That said, the drivers do have some complexity. synthDrivers/oneCore.py maintains its own queue. All 3 have different SSML algorithms (looking at commit history, espeak seems to allow malformed SSML while OneCore rejects it).