Open zcorpan opened 2 years ago
I would like to propose that along with the screen reader text output, the API also provide information about the element, such as the accessible name, aria attributes, and properties (similar to the accessibility views of Chrome and Firefox).
The reason for this is that without normalizing the text string, I feel that it would be difficult to make any assertions about the data because how screen readers report the data can greatly vary.
For example, a button can be read as <button name>, button
(VoiceOver / Safari, JAWS / Edge, JAWS / Chrome) or button, <button name>
(NVDA / Firefox). Theres also the additional help description text that may or may not be provided, such as press enter to activate
(depending on screen reader verbosity settings), or the addition of state, such as using aria-disabled
which could output as dimmed
(VoiceOver / Safari) or unavailable
(JAWS / Edge, JAWS / Chrome, NVDA / Firefox).
From a testing point of view, it would be difficult to write an assertion to ensure the text is what you expect. A proper assertion would need to understand the nuances between all the different screen readers and choose the assert that matches. There's also the problem of when a screen reader updates and the text changes, the assertion could break.
Lastly, based on todays discussion, how would we indicate to the user of the API when a screen reader output is given due to a delayed response, such as an update to an aria-live region? If I wanted to write a test that ensures an aria-live region is updated when clicking a button, how would I be notified when that happens?
@straker Great feedback. The API having some means of providing access to what I'll call "developer info", i.e. the states, names, roles, etc. for an object, is an interesting idea that I think is worth exploring. However:
I don't personally think that this API should offer access to underlying mark-up or attributes, including ARIA. My reasons being that:
To expand on this, the internal representation of an object within a screen reader is the most relevant to how that object's semantics are conveyed to users. Ensuring that the use of ARIA and other technologies is translated correctly across API boundaries is an important task, but belongs in the test suites for browsers, UI toolkits and other places where that translation happens.
To address your last question:
... how would we indicate to the user of the API when a screen reader output is given due to a delayed response, such as an update to an aria-live region? If I wanted to write a test that ensures an aria-live region is updated when clicking a button, how would I be notified when that happens?
My understanding is that you would trigger the live region feedback in some way, and then write code to expect speech output within a certain period of time. If the speech feedback didn't arrive within that timeout, you would consider the test to have failed. This is similar to how certain integration tests are written today against browsers.
Having said that, some means of tracking that in a more reliable way may be possible/desirable. As the update originates in the browser, I'm not sure how/what that might look like, but suggestions are welcome.
Thanks @straker
the information is already available via browser developer tools; and
But you can't use that to infer anything about the spoken output from this API. For example, a test could be to repeatedly pressing the key for "next link" and verify that all links are announced and no other kinds of elements are announced. Without direct access to the role, you'd have to parse the text "link" from the spoken output and assume that it refers to the role and that word isn't used in the text content. I think the utility here isn't limited to testing browsers.
Would it be possible for screen readers to have additional information in the "spoken output" events to also include things like role and accessible name?
My understanding is that you would trigger the live region feedback in some way, and then write code to expect speech output within a certain period of time.
Indeed.
Would it be possible for screen readers to have additional information in the "spoken output" events to also include things like role and accessible name?
That would be most helpful. My ultimate goal for this type of API is to be able to automate the tests for https://a11ysupport.io/ to determine screen reader / browser support of ARIA in a similar manner as the tests for caniuse.com.
My understanding is that you would trigger the live region feedback in some way, and then write code to expect speech output within a certain period of time
This implies that the output API is event based (and thus asynchronous). This would make writing tests difficult since you can't be sure the action you just took produced the text output received. For the API an event based or asynchronous API makes sense, so I wonder if we should add something to the output to help determine the action that triggered the output? Essentially someway to be sure that you're looking at the output that was generated from the action (be that a keypress, state change, or live event).
It's indeed async and output would be events in the protocol.
I don't know if screen readers are able to keep track of cause and effect between pressing a key (or other action) and utterances.
@zcorpan
Would it be possible for screen readers to have additional information in the "spoken output" events to also include things like role and accessible name?
I want to reiterate a distinction here between ARIA, the browser's accessibility tree, and the internal representation of objects within a screen reader. The first two are available via browsers and other tools, the third is not, and I think that exposing internal screen reader representation of objects via this API is an interesting and potentially useful proposal. That representation would include name, role, states and similar.
However, I don't believe it relates to spoken output in any way, for the reasons already stated. Screen readers track a large amount of information about objects that they don't speak, or speak in a particular way, depending on settings, verbosity levels, internal use of that information, etc. Therefore, to explore this further, I suggest a new issue. Annotating speech output with programmatic information is a tough ask and is, at least for NVDA, not how the screen reader works; establishing those connections would require speech parsing in itself. Then, later on, we'd just have to repeat the exercise for braille.
I don't know if screen readers are able to keep track of cause and effect between pressing a key (or other action) and utterances.
I won't speak for all screen readers, but to my knowledge, NVDA does not do this. And I'm not sure how it would, because live notification feedback originates in a completely different application and its delivery is asynchronous. Meanwhile, despite the screen reader handling the initial keyboard input, that can't be said for mouse and other means of triggering a live region, nor can the screen reader be sure that the speech following the trigger is actually anything to do with the live region itself. E.g. a Windows notification could come in before the feedback.
As such, converting these async delivery processes to a synchronised pattern is the job of the calling code should it need to, not the API spec. To see this sort of setup in action, you can try Playwright or similar, which heavily relies on an expectation-based pattern and timeouts.
However, I don't believe it relates to spoken output in any way, for the reasons already stated. Screen readers track a large amount of information about objects that they don't speak, or speak in a particular way, depending on settings, verbosity levels, internal use of that information, etc. Therefore, to explore this further, I suggest a new issue. Annotating speech output with programmatic information is a tough ask and is, at least for NVDA, not how the screen reader works; establishing those connections would require speech parsing in itself. Then, later on, we'd just have to repeat the exercise for braille.
That's a fair point. For the purposes of the use cases I have in mind (writing asserts about what the screen reader supports, making sure patterns are accessible to a screen reader), parsing the text output is not necessary (or useful).
As such, converting these async delivery processes to a synchronised pattern is the job of the calling code should it need to, not the API spec. To see this sort of setup in action, you can try Playwright or similar, which heavily relies on an expectation-based pattern and timeouts.
Could you expound upon this further? My understanding of Playwright is that the API is based on promises that return when the event is completed. So selecting a button and pressing enter on it will each resolve a Promise when the action finishes. If the action returns data, the data is the result of the desired action (so the <button>
element if trying to select a button). That same concept doesn't seem to apply to the screen reader output due to out-of-band events.
I guess I'm trying to envision how you see the API being used.
@straker An operation like "select a button" is a good example of an operation with a unambiguous result--the button element. With a clear cause and effect, it's only natural that the protocol would strongly relate the request with the result. On the other hand, the "result" of operations that model user input is harder to pin down.
For example, the operation "press a key" might trigger a navigation, it might change text in an ARIA "live" region, it might do something else entirely, or it might do nothing at all. The effect (if there was one) wouldn't necessarily be synchronous from the browser's perspective since processing might involve non-blocking calls like setTimeout
or fetch
. Recognizing when the key press is "done" therefore requires domain knowledge--an understanding of the application under test.
That's probably why CDP (the protocol which powers Playwright) exposes a generic "dispatchMouseEvent" method with a more limited expectation: "Dispatches a mouse event to the page." Playwright's click
API layers on some semantics which wait for navigation, but it doesn't make any guarantees about changes to the document itself.
What information should the API expose for "spoken output"?
The text string seems obvious, but is not all that a screen reader can send to the TTS.
For Microsoft Speech API, there's an XML format for changing volume, rate, timing, and so on for the text: https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ee431815(v=vs.85)
And there are flags when creating a speak call: https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v=vs.85)
I'm not sure what VoiceOver does, but found this documentation for AVFoundation which might be relevant: https://developer.apple.com/documentation/avfaudio/avspeechutterance
For ARIA-AT, currently we're only checking the text. But clearly the TTS APIs support more nuance than only text. The question is, what should we expose as "the captured output" in AT Driver?
I think a reasonable starting point would be the text only, but allow for vendor-specific extensions for more information.