Accessible developer experience for the prototype "automation voice"

s3ththompson commented 3 years ago

The current approach of this AT automation experiment is to create a special "automation voice" that registers as a SAPI 5 voice. Rather than synthesize sound via a text-to-speech engine, the "automation voice" records the textual content of the vocalization and sends it to a local harness/service that records the output and uses it to assert whether the vocalization matches a particular string.

The "automation voice" is unfinished in that it does not synthesize any sounds, thus by definition it does not yet provide an accessible developer experience. This issue raises a number of potential approaches for making the "automation voice" accessible.

from @jugglinmike

Potential approaches

Screen reader + screen reader

Run the screen reader under test alongside the user's screen reader of choice

Screen reader + screen reader in VM

Run the screen reader under test inside a virtual machine

Screen reader + plugin to retrieve speech data

Integrate with each screen reader's proprietary interface for discerning what it's vocalizing (this may not be available in every screen reader)

Automation voice + automated toggling

Automatically reconfigure the system screen reader prior to executing tests, and restore the original configuration at the tests' completion; demonstrated by this prototype

Automation voice + ability to vocalize

In theory, this prototype could use an open-source C++ library to enunciate words in addition to providing it as text data to the test runner

Automation voice + forward to built-in voice

In theory, this prototype could use the operating system's built-in voices to enunciate words in addition to providing it as text data to the test runner

AssitivLabs

Fellow stakeholder Weston Thayer is building a service which maintains web browsers and screen readers internally and allows clients to visit their own web pages using them; assistivlabs.com

Feasibility

No one approach is known to be suitable for all of the screen readers we intend to support. The following matrix documents our current understanding of what's possible (signified by "yes"), what's not possible (signified by "no"), and what is currently unknown (signified by "?"). Subsequent annotations elaborate on these qualifications.

	NVDA	JAWS	Narrator	VoiceOver
Screen reader + screen reader	no [1]	no [1]	no [1]	no [1]
Screen reader + screen reader in VM	no [2]	no [2]	no [2]	no [2]
Screen reader + plugin to retrieve speech data	yes [3]	? [4]	? [5]	yes [6]
Automation voice + automated toggling	? [7]	? [7]	? [7]	? [8]
Automation voice + ability to vocalize	? [9]	? [9]	? [9]	? [8]
Automation voice + forward to built-in voice	? [10]	? [10]	? [10]	? [8]
AssitivLabs	? [11]	? [11]	? [11]	? [12]

rejected because interactions between screen readers is unpredictable and will potentially interfere with the automation
rejected because the target platforms (Windows, macOS, and iOS) are proprietary, and this introduces legal and technical hurdles to providing virtual machines to all contributors
demonstrated by Simon's prototype
unknown if JAWS exposes speech data
unknown if Narrator exposes speech data
demonstrated by the "auto-vo" project
unknown if the risk of failure to recover screen reader is too great
unknown third-party voices can be built for macOS/iOS
unknown if support for non-English languages is required
unknown if Windows' built-in voices can be used in this way
unknown if the service's support for outdated versions will satisfy the needs of an ARIA-AT contributor
advertised as "Coming soon"

mfairchild365 commented 3 years ago

@jscholes This is for an issue that you originally surfaced for the automation prototype. I believe your concern was that overriding the configuration of the current screen reader to use a different voice could cause confusion and unintentional issues for end-users if, for example, the tests failed and our automation script was unable to return the screen reader to the previous settings for whatever reason. @jugglinmike put together some potential solutions to this and we discussed them on our call today. Minutes are located at: https://www.w3.org/2021/08/30-aria-at-minutes.html

Our proposal is to move forward with the "automation voice + automated toggling" technique for Windows (and potentially other operating systems). Additionally, we can potentially mitigate issues by warning users ahead of time that screen reader settings will be automatically configured, providing some sort of audible update of progress as tests are executing, providing a mechanism to abort the tests and return to previous settings, and instruct the user how to recover if there is a catastrophic failure and configuration can't be automatically restored.

We would like your feedback on this approach and the other approaches that are listed.

jugglinmike commented 3 years ago

Just to clarify @jscholes, as @mfairchild365 mentioned, we discussed the merits of "automation voice + automated toggling" from a technical perspective, but we wanted to get your input as well as that of other non-sighted users before determining what was acceptable from an accessible experience design perspective.

In particular, before we proceed, I'd like to clarify: whether you are aware of other "Potential approaches" we should consider and whether you can share more about potential safety concerns with disabling and re-enabling a screen reader.

jscholes commented 3 years ago

I think it's important to determine, and distinguish between, our desired use cases here.

I initially raised accessibility/inclusivity concerns in relation to: a blind user, relying on a screen reader, wanting to contribute to the ARIA-AT project by helping out with development of the automation stack. That would still seem to be the primary thrust of this thread, given its title.

However, there are some parts of the described behaviour here that could be implying additional scope. Specifically, testers actually utilising one or more parts of the automation stack while running tests. On previous CG meetings we've discussed the idea of automatically gathering speech output from a screen reader, for example, instead of relying on testers manually gathering and pasting it into the results form.

I think this is an important distinction, because what is acceptable for one audience isn't necessarily feasible for the other:

Rightly or wrongly, targeting developers allows us to infer a certain level of familiarity/comfort with technology, including the prospect of having to manually rescue from a screen reader failure. It's fair, I feel, to ask devs to take on that small level of risk, particularly if we provide some sort of escape hatch e.g. an explicitly separate program to restore previous settings. In the case of NVDA, the automation stack could even just run its own portable copy, and the user's installed one wouldn't need to be impacted.
No matter the level of training we offer testers, though, the same just does not hold true. Not all screen reader users will feel comfortable getting themselves out of a no-speech situation, or even be able to regardless of how hard they try. This is particularly the case if the screen reader being tested is still running silently, and they want to restore another one.

This really leads me onto the two takeaways/questions that I want to end with:

Even if our target audience is developers right now, we need to determine the likelihood of this audience expanding in the future, and when any such expansion is likely to occur. If we know, for instance, that early next year we'll be aiming to work some automation-related bits and pieces into the human tester UX, it is not worth going down one road when we know we'll need to backtrack.
With my developer hat on, I don't know why this approach is considered to be easier than the "Automation voice + forward to built-in voice" one, and I'd love to discuss it in more detail. We're potentially talking about:
- managing the running state of up-to-two screen readers (the user's prefered one plus the one under test);
- some accessible, self-voicing test progress system (which presumably won't support braille);
- the storage and restoration of a user's screen reader settings; and
- the writing of recovery instructions, plus supporting users who need to use them.
This is in comparison to some calls to a built-in SAPI5 voice, which many programs implement without fanfare. I'm sure I'm missing something, which is why I want us to have a discussion about it. The system will only be active for a short time, it doesn't need to be perfect. It just needs to talk. Rate and such can be configured in the OS settings.

I don't want to unnecessarily block progress here, or suggest that certain things are simple if they are in fact anything but. But I do want us to talk about it, because right now the "Automation voice + forward to built-in voice" row of the table entirely consists of questionmarks. It seems like we can at least make progress on resolving that, even if we end up flipping some of them to "No". And given the sheer number of programs out there that already output to SAPI, I'd be surprised if we can't flip most of them (macOS aside) to "Yes". The comment for footnote #10 reads:

unknown if Windows' built-in voices can be used in this way

What do we need to do to clear up that unknown? I would be surprised if a SAPI5 engine cannot forward speech onto another one; this is very similar to how the SAPI5 version of ETI Eloquence from CodeFactory works. Granted, they're forwarding speech onto another DLL, not a secondary SAPI5 voice. But as long as we don't feed speech from our own engine back into itself, it should be fine.

jscholes commented 3 years ago

CC @sinabahram

mfairchild365 commented 3 years ago

Those are very valid points. @jugglinmike and @s3ththompson - what would it take to research the other options further, specifically forwarding to another voice?

jugglinmike commented 3 years ago

Thanks, @jscholes. You're right that there's a lot of uncertainty here, owing largely to my own lack of experience in the domain of Windows programming. The question marks are intended only to document the edges of understanding, not to preclude any particular direction. Transparency is that regard is helpful because it's one of many factors which influence how we proceed (and indeed, who it is that does the proceeding).

Another factor is the usability implications of the alternatives. Your expertise is especially helpful in sussing that out, so thank you!

In the time since posting this issue, I've made some headway toward the alternative named "Automation voice + forward to built-in voice." The way I've integrated with Microsoft SAPI is primitive (you can see for yourself on the main branch of this repository), but at least the amount of uncertainty has shrunk. On the Bocoup side of things, we're refining the roadmap for this work, so I'm hoping to continue in this direction.

(edited to remove hard line breaks, sorry about that)

jscholes commented 3 years ago

@jugglinmike This all sounds great. Thank you for continuing to look into such alternatives. Looking forward to further developments/updates!

w3c / aria-at-automation-driver