Capture spoken output: what to include

w3c / at-driver

AT Driver defines a protocol for introspection and remote control of assistive technology software, using a bidirectional communication channel.

https://w3c.github.io/at-driver

Other

33 stars 4 forks source link

Capture spoken output: what to include #24

Open zcorpan opened 2 years ago

zcorpan commented 2 years ago

What information should the API expose for "spoken output"?

The text string seems obvious, but is not all that a screen reader can send to the TTS.

For Microsoft Speech API, there's an XML format for changing volume, rate, timing, and so on for the text: https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ee431815(v=vs.85)

Voice state control tags
- volume
- rate
- speed
- pitch
- emphasis
- spell out
Direct item insertion tags
- Silence
- Pron
- Bookmark
Voice context control
- PartOfSp tag provides the voice with the part of speech of the enclosed word(s)
- Context tag provides the voice with information which the voice may then use to determine how to normalize special items, like dates, numbers, and currency
Voice selection
- Voice tag selects a voice based on its attributes, Age, Gender, Language, Name, Vendor, and VendorPreferred
- Lang tag selects a voice based solely on its Language attribute

And there are flags when creating a speak call: https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v=vs.85)

using plain text vs XML (either SAPI XML or SSML)
whether to purge prior speak requests
whether the text is a filename
whether punctuation characters should be expanded into words

I'm not sure what VoiceOver does, but found this documentation for AVFoundation which might be relevant: https://developer.apple.com/documentation/avfaudio/avspeechutterance

An utterance can be created with a string, an attributed string, a string that contains International Phonetic Alphabet (IPA) symbols, or SSML
Configuring the utterance
- voice
- pitch multiplier
- volume
- prefers Assistive Technology settings
Configuring Utterance Timing
- rate
- minimum rate
- maximum rate
- default rate
- pause before speaking the utterance
- pause after speaking the utterance

For ARIA-AT, currently we're only checking the text. But clearly the TTS APIs support more nuance than only text. The question is, what should we expose as "the captured output" in AT Driver?

I think a reasonable starting point would be the text only, but allow for vendor-specific extensions for more information.

straker commented 2 years ago

I would like to propose that along with the screen reader text output, the API also provide information about the element, such as the accessible name, aria attributes, and properties (similar to the accessibility views of Chrome and Firefox).

The reason for this is that without normalizing the text string, I feel that it would be difficult to make any assertions about the data because how screen readers report the data can greatly vary.

For example, a button can be read as <button name>, button (VoiceOver / Safari, JAWS / Edge, JAWS / Chrome) or button, <button name> (NVDA / Firefox). Theres also the additional help description text that may or may not be provided, such as press enter to activate (depending on screen reader verbosity settings), or the addition of state, such as using aria-disabled which could output as dimmed (VoiceOver / Safari) or unavailable (JAWS / Edge, JAWS / Chrome, NVDA / Firefox).

From a testing point of view, it would be difficult to write an assertion to ensure the text is what you expect. A proper assertion would need to understand the nuances between all the different screen readers and choose the assert that matches. There's also the problem of when a screen reader updates and the text changes, the assertion could break.

Lastly, based on todays discussion, how would we indicate to the user of the API when a screen reader output is given due to a delayed response, such as an update to an aria-live region? If I wanted to write a test that ensures an aria-live region is updated when clicking a button, how would I be notified when that happens?

jscholes commented 2 years ago

@straker Great feedback. The API having some means of providing access to what I'll call "developer info", i.e. the states, names, roles, etc. for an object, is an interesting idea that I think is worth exploring. However:

For the purposes of the ARIA-AT project at least, access to that information cannot act as a replacement for parsed speech output. Specifically, this is because:
- There may be cases where the screen reader is aware of one or more aspects of an object, but doesn't convey them to the user for some reason, and we need to catch that. If a user cannot perceive something important, then the fact that the screen reader is tracking it internally is of no consequence to them.
- You're right that speech output patterns can and will change between screen reader versions, but this is actually a benefit for the project. Without such changes being flagged by the automated system, we will have no way of tracking screen reader behaviour and/or support that alters over time, and ARIA-AT results need to reflect such movement based on speech output parsing.
I don't personally think that this API should offer access to underlying mark-up or attributes, including ARIA. My reasons being that:
- the information is already available via browser developer tools; and
- this automation spec relates to screen reader testing in general, not just for the web.
To expand on this, the internal representation of an object within a screen reader is the most relevant to how that object's semantics are conveyed to users. Ensuring that the use of ARIA and other technologies is translated correctly across API boundaries is an important task, but belongs in the test suites for browsers, UI toolkits and other places where that translation happens.

To address your last question:

... how would we indicate to the user of the API when a screen reader output is given due to a delayed response, such as an update to an aria-live region? If I wanted to write a test that ensures an aria-live region is updated when clicking a button, how would I be notified when that happens?

My understanding is that you would trigger the live region feedback in some way, and then write code to expect speech output within a certain period of time. If the speech feedback didn't arrive within that timeout, you would consider the test to have failed. This is similar to how certain integration tests are written today against browsers.

Having said that, some means of tracking that in a more reliable way may be possible/desirable. As the update originates in the browser, I'm not sure how/what that might look like, but suggestions are welcome.

zcorpan commented 2 years ago

Thanks @straker

the information is already available via browser developer tools; and

But you can't use that to infer anything about the spoken output from this API. For example, a test could be to repeatedly pressing the key for "next link" and verify that all links are announced and no other kinds of elements are announced. Without direct access to the role, you'd have to parse the text "link" from the spoken output and assume that it refers to the role and that word isn't used in the text content. I think the utility here isn't limited to testing browsers.

Would it be possible for screen readers to have additional information in the "spoken output" events to also include things like role and accessible name?

My understanding is that you would trigger the live region feedback in some way, and then write code to expect speech output within a certain period of time.

Indeed.

straker commented 2 years ago

Would it be possible for screen readers to have additional information in the "spoken output" events to also include things like role and accessible name?

That would be most helpful. My ultimate goal for this type of API is to be able to automate the tests for https://a11ysupport.io/ to determine screen reader / browser support of ARIA in a similar manner as the tests for caniuse.com.

My understanding is that you would trigger the live region feedback in some way, and then write code to expect speech output within a certain period of time

This implies that the output API is event based (and thus asynchronous). This would make writing tests difficult since you can't be sure the action you just took produced the text output received. For the API an event based or asynchronous API makes sense, so I wonder if we should add something to the output to help determine the action that triggered the output? Essentially someway to be sure that you're looking at the output that was generated from the action (be that a keypress, state change, or live event).

zcorpan commented 2 years ago

It's indeed async and output would be events in the protocol.

I don't know if screen readers are able to keep track of cause and effect between pressing a key (or other action) and utterances.

jscholes commented 2 years ago

@zcorpan

Would it be possible for screen readers to have additional information in the "spoken output" events to also include things like role and accessible name?

I want to reiterate a distinction here between ARIA, the browser's accessibility tree, and the internal representation of objects within a screen reader. The first two are available via browsers and other tools, the third is not, and I think that exposing internal screen reader representation of objects via this API is an interesting and potentially useful proposal. That representation would include name, role, states and similar.

However, I don't believe it relates to spoken output in any way, for the reasons already stated. Screen readers track a large amount of information about objects that they don't speak, or speak in a particular way, depending on settings, verbosity levels, internal use of that information, etc. Therefore, to explore this further, I suggest a new issue. Annotating speech output with programmatic information is a tough ask and is, at least for NVDA, not how the screen reader works; establishing those connections would require speech parsing in itself. Then, later on, we'd just have to repeat the exercise for braille.

I don't know if screen readers are able to keep track of cause and effect between pressing a key (or other action) and utterances.

I won't speak for all screen readers, but to my knowledge, NVDA does not do this. And I'm not sure how it would, because live notification feedback originates in a completely different application and its delivery is asynchronous. Meanwhile, despite the screen reader handling the initial keyboard input, that can't be said for mouse and other means of triggering a live region, nor can the screen reader be sure that the speech following the trigger is actually anything to do with the live region itself. E.g. a Windows notification could come in before the feedback.

As such, converting these async delivery processes to a synchronised pattern is the job of the calling code should it need to, not the API spec. To see this sort of setup in action, you can try Playwright or similar, which heavily relies on an expectation-based pattern and timeouts.

straker commented 2 years ago

However, I don't believe it relates to spoken output in any way, for the reasons already stated. Screen readers track a large amount of information about objects that they don't speak, or speak in a particular way, depending on settings, verbosity levels, internal use of that information, etc. Therefore, to explore this further, I suggest a new issue. Annotating speech output with programmatic information is a tough ask and is, at least for NVDA, not how the screen reader works; establishing those connections would require speech parsing in itself. Then, later on, we'd just have to repeat the exercise for braille.

That's a fair point. For the purposes of the use cases I have in mind (writing asserts about what the screen reader supports, making sure patterns are accessible to a screen reader), parsing the text output is not necessary (or useful).

As such, converting these async delivery processes to a synchronised pattern is the job of the calling code should it need to, not the API spec. To see this sort of setup in action, you can try Playwright or similar, which heavily relies on an expectation-based pattern and timeouts.

Could you expound upon this further? My understanding of Playwright is that the API is based on promises that return when the event is completed. So selecting a button and pressing enter on it will each resolve a Promise when the action finishes. If the action returns data, the data is the result of the desired action (so the <button> element if trying to select a button). That same concept doesn't seem to apply to the screen reader output due to out-of-band events.

I guess I'm trying to envision how you see the API being used.

jugglinmike commented 2 years ago

@straker An operation like "select a button" is a good example of an operation with a unambiguous result--the button element. With a clear cause and effect, it's only natural that the protocol would strongly relate the request with the result. On the other hand, the "result" of operations that model user input is harder to pin down.

For example, the operation "press a key" might trigger a navigation, it might change text in an ARIA "live" region, it might do something else entirely, or it might do nothing at all. The effect (if there was one) wouldn't necessarily be synchronous from the browser's perspective since processing might involve non-blocking calls like setTimeout or fetch. Recognizing when the key press is "done" therefore requires domain knowledge--an understanding of the application under test.

That's probably why CDP (the protocol which powers Playwright) exposes a generic "dispatchMouseEvent" method with a more limited expectation: "Dispatches a mouse event to the page." Playwright's click API layers on some semantics which wait for navigation, but it doesn't make any guarantees about changes to the document itself.