The limitations of polling for vocalized text

jugglinmike commented 1 year ago

Hi everyone! We've recognized two ways to capture VoiceOver's utterances as textual data: by installing a "voice" which broadcasts the stream of data it receives from the operating system, or by repeatedly querying (a.k.a. "polling") a property where the operating system stores the most recently-vocalized text.

In this issue, I'd like to outline the deficiencies of the polling approach with the goal of ruling it out for future considerations. I'd love to hear from anyone who believes that polling could still be viable, either in this discussion thread or during the ARIA-AT CG Group Meeting on 2022-10-24 at 15:00 ET (teleconference and IRC: irc://irc.w3.org:6667/#aria-at).

One method for programmatically observing text vocalized by Apple VoiceOver is querying the "last phrase" API (as accessible through AppleScript and JXA). Using this technique to observe behavior during an active scripting session necessarily involves repeatedly referencing the value as it changes over time, commonly referred to as "polling."

Deficiencies

The practice of polling for vocalizations presents at least two risks of information loss.

Missing vocalizations due to rapid dispatch

In a solution built on polling, the value being observed could change multiple times between queries. This means there can be no guarantees that every event is captured. Missing a vocalization could result in false negatives whenever the test runner fails to observe an expected speech event prior to the dispatch of a subsequent event. It could also result in false positives when an unexpected speech event is "masked" by a subsequent speech event.

Polling at a higher frequency can potentially mitigate this issue, but is not a guarantee against race conditions and missed events. Assumptions based on timing are fragile, particularly in the context of resource-constrained continuous integration environments where we expect this system to be deployed. Behavior that appears reliable to implementers could nonetheless cause flaky tests in practice, where diagnosing (or even reproducing) errors would be particularly challenging.

Missing vocalizations due to repeated content

In a solution built on polling, it's not possible to detect when a given utterance is repeated (the speech data has no associated timestamp or other identifying information). This could result in false negatives when the application under test is expected to present duplicate content. It could also cause false positives if the screen reader inappropriately repeats some utterance (whether due to a content error or a bug in the AT itself) because this would be unobservable.

For the ARIA-AT project, we might be able to avoid this problem through careful test design. ARIA-AT could even enforce a policy with an automated process to alert contributors when their submissions run afoul. However, this very condition (e.g. two list items with identical content) could conceivably constitute a meaningful regression test, so such a restriction could reduce ARIA-AT's utility.

For web developers relying on a standard protocol, an equivalent mitigation strategy (e.g. a warning about repeated content) would be easy to overlook. Platform designers refer to this kind of flaw as a "foot gun." It would also be difficult to comply as there are ostensibly many situations where repetition is desirable. Such a limitation could ultimately represent a hazard in the application of the technology.

Ruling out polling

We believe these deficiencies are fundamental to a polling approach and that if used as the foundation of an automation solution, they would threaten the viability of that system. In other words: the "last phrase" feature of the VoiceOver scripting API is insufficient for the ARIA-AT project and the developing standard.

This is why we have been developing a solution that integrates with the macOS text-to-speech engine. In addition to enjoying the robustness of a publisher/subscriber communication paradigm, we also appreciate how the in-band metadata available through this method (e.g. markers for changes to volume and pitch) opens the door for future enhancements to ARIA-AT.

jscholes commented 1 year ago

@jugglinmike Great write-up!

We (PAC) are using the outlined polling approach in an internal tool for human testers (hopefully to be released soon), because the advantages drastically outweigh the cognitive load and physical/logistical demands of manually gathering complex output from VoiceOver. In such a scenario, it is the human tester's responsibility to error-check the output they are submitting , to ensure that no speech was omitted (including intended repetition).

However, I am aligned with the limitations as stated here, and agree completely that they rule it out as an approach for robust automation without manual oversight.

jugglinmike commented 1 year ago

Thanks to everyone who attended yesterday's community group meeting! The attendees generally agreed that a polling API built on the current capabilities of AppleScript’s VoiceOver "last phrase" property is fundamentally insufficient for the purposes of ARIA-AT. However, some concern was expressed about the stability of an event-based API based on the Speech Synthesis Manager which we've been using in our prototype for macOS[^1], but which was recently deprecated. The practical realities there may motivate a decision to accept the risks of polling, so we decided to investigate further before formally ruling it out.

macOS Ventura (released on 2022-10-24) introduces a new speech synthesis API called AVSpeechSynthesisProviderAudioUnit.That seems like the API most likely to be stable and well-supported in the future (@mcking65 mentioned that requiring "Ventura or later" would be acceptable), but we have no experience with the API or its capabilities, yet[^2].

We would like to gather some additional context around the relationship between the old Speech Synthesis Manager and the new AVSpeechSynthesisProviderAudioUnit. @jscholes suggested there might be distribution restrictions (or permissions restrictions?) on custom voices not using the new API, separate from any deprecation concerns. @cookiecrook, can you share any context here on the API changes and what the deprecation status means practically for distributing a tool used by the ARIA-AT project?

[^1]: The source code for our prototype is available here and follows sample code found in a Morse Synthesizer demo. [^2]: The new API has partial sample code here.

lolaodelola commented 5 months ago

@jugglinmike is this still relevant, considering that we're now two macOS versions ahead of Ventrua?

jugglinmike commented 5 months ago

@lolaodelola Yup, this is still relevant.

It's true that we still expect to use a "push-based" approach (as opposed to a polling approach) for our forthcoming macOS implementation. Even so, I'd like to leave open the possibility of polling (and likewise leave open this issue) as a fallback--flawed though it may be--until we've deployed a functional implementation.

w3c / at-driver