[css-speech-1] speech media type mutually exclusive with most assistive technology, including screenreaders and screen magnifiers with speech

cookiecrook commented 4 years ago

Speech media type mutually exclusive with most assistive technology, including screenreaders and screen magnifiers with speech.

The speech media type is mutually exclusive with the screen media type. Most assistive technology relies on screen media (hence the "screen" in "screenreader"), so it's inappropriate to republish this older document in today's context. If there is sufficient implementor interest, the concepts could be incubated, perhaps as media features.

cookiecrook commented 4 years ago

For example: several of the properties (pause-before) should not be applied to contexts where an AT user—such as a blind screen reader user, or a speak-on-hover Zoom user—navigates directly to a particular element. User expectations for those is that speech always be generated immediately. Otherwise, it will seem to the user that their AT or the page has hung.

However, pause-before could potentially be applied as something like a linear-audio media feature. The context where this could be used applies to the original DAISY case (e.g. "generated audiobook from web resource") or certain assistive technology (e.g. most screen readers have a "read me everything linearly" mode)…

There are a few such as speak/speak-as that may be generally applicable to speech in all contexts, but platform API support for this concept is lacking and potentially unimplementable (in the short term) by rendering engines. As an example, one way speak-as: literal-punctuation; had been implemented in the past was to replace punctuation with words in the platform Accessibility APIs. e.g. label "one, two, three" became "one comma two comma three." This was a workaround for a lack of platform API, and the implementation resulted in broken braille. Punctuation comma (⠂) was being represented as a full expansion of the word 'comma' (⠉⠕⠍⠍⠁) therefore filling up braille displays where characters are at a premium. I am fairly certain these platform API implementability limitations still exist on several platforms.

cookiecrook commented 4 years ago

Sometimes I've heard people assume "screen" media would not apply to blind screen reader users, so I'd like to dispel that myth.

Screen readers (the software) rely on screen layout. Most screen reader users (the people) have some sight, and rely on visual screen layout and other display properties. According to WHO data, approximately four-fifths of the world's "blind" population are not completely blind. Many "low vision" or "legally blind" zoom users use speech as well (e.g. speak-on-hover). Even completely blind screen reader users use the screen layout properties; for example, VoiceOver on iOS retains useful spatial layout data on the touch screen. Even if the user can’t see it, they can retain and use a spatial orientation of the screen layout.

frivoal commented 4 years ago

Media types in general have proven to be a bad idea, due to their exclusive nature, and I indeed media features have proven to be a much more robust approach. And you're absolutely right that the screen media type is the right one for screen readers (there is a note to that effect in the mediaqueries 4 spec).

Reconsidering which parts of the css-speech are meant to apply in which context makes a lot of sense to me. That said, I am not sure these contexts are adequately described as media features. Your example of linear-audio seems to show why. As you said, It'd match either in the DAISY use case, or when a screen reader is instructed to "read me everything". In the later case, exposing that as a media feature would be problematic: arbitrary styling can be applied based on a media feature matching or not, and it seems very unlikely that offering the ability to restyle arbitrary parts of the web page while a screen reader is instructed to read everything is anything but a foot-gun. Nonetheless, regardless of whether it is exposed through media queries, whether we're rendering something as linear audio or not is an important distinction, so there should be a definition somewhere, and various aspect of the behavior likely depend on it.

cookiecrook commented 4 years ago

So we agree media types are bad, and media features may not be right here. How would you recommend making that that contextual distinction?

frivoal commented 4 years ago

Somewhere early in the spec (right after the intro?), have a section that defines the (two?) modes and gives them a name. Later in the spec, we can then say things like "this property has no effect in linear mode", or "when in navigation mode, this value instead causes blah to happen".

This certainly would need to be refined (and bikesheded), but just to give an idea, here's a rough draft of what this early section could be. This would replace what is currently in section 2 (though some bits about the history of this document / feature may be worth salvaging).

Speech modes

When rendering web content via speech synthesis, User Agents may behave in one of two distinct ways, resulting in a significantly different user experience, and appropriate for different circumstances. The following two speech modes are defined:

Linear mode: The User Agent reads out the entire document—or an entire subtree of the document— linearly. While the user may have various ways to control the narration (such as pause, resume, skip forward, bookmarks…), this is primarily a passive listening experience, and the content being read aloud is largely the text of the document itself.
Navigation mode: The User Agent helps the user navigate a document by describing its content and structure to the user, and allowing them to focus or interact with its various parts. While this involves reading out pieces of the document to the user, this is not necessarily done in a linear fashion, often involves describing the document as much as reading from it, typically takes into account information derived from the visual layout of the document as well, and is primarily an interactive experience.

Screen readers primarily work in navigation mode, although some may offer the user a way to request that all or part of the content be read in linear mode. In contrast, User Agents offering an audiobook experience generated from text work only or primarily in linear mode. Similarly, when instructed to read a document, or when presenting information extracted from a web resource, virtual assistants (such as Amazon Alexa, Apple's Siri, or Google Assistant) work in the Linear mode paradigm.

Note: Earlier in the development of the web platform, there was an attempt to categorize various rendering media into several mutually exclusive media types, one of which was speech, as opposed to screen. Experience has proved this approach to be problematic, in part because the assumption that they were mutually exclusive turned out to be unfounded, and the idea of media types is being phased out (See MEDIAQUERIES 4 § 2.3). For instance, User Agents with assistive technologies typically render a document both to screen and via speech at the same time. Those focused on Linear mode audiobook-like experience may also display the content that they are reading. Thus, the speech mode being used to render (part of) a document and the media type are independent concepts, and one cannot be inferred from the other.

Note: Speech modes are not exposed as media features, as there is no expectation that authors should be able to arbitrarily restyle a document based on which mode is used, anymore than being able to arbitrarily restyle a document based on whether a user looking at a visual rendering is reading it top down or skimming / scrolling through the content.

This doesn't tell us which bit of the spec needs to behave in what way according to which mode we're in—that's left as an exercise for later. But it establishes terminology that lets us draw these distinctions where needed.

Here's a (possibly misguided) example:

In navigation mode, if an element is being read directly (rather than due to being a descendant of an element being read), and its cue-before is none, then the value of its rest-before property must be ignored and treated as if it were none.

cookiecrook commented 4 years ago

I like the start, but 1) screen readers have a lot of "modes", and 2) the definitions shouldn't be limited to screen readers. Copying in @LJWatson since she expressed an interest in editing the spec.

w3c / csswg-drafts

[css-speech-1] speech media type mutually exclusive with most assistive technology, including screenreaders and screen magnifiers with speech #4868

Speech modes