Best practice document: Extracting data for TTS and a "reader mode"

HadrienGardeur commented 5 months ago

Text-to-speech (TTS) is among the most popular features in reading apps and slowly creeping up as a must-have feature in Web browsers as well.

But despite the popularity and usefulness of TTS, there is no best practice document providing guidance for developers on how they should implement this feature. The group working on accessibility for FXL publications has also identified that in addition to TTS, extracting text from an FXL resource could be used to provide a "reader mode" of the current page/spread, enabling users to adjust the text and layout to their needs.

For both TTS and a reader mode, reading systems need guidance about the way they should extract data from XHTML to build these alternate renderings:

using accessibility metadata to infer what might be possible (accessModeSufficient, readingOrder, alternativeText, longDescription)
walking the DOM to create an alternate tree-like structure
rules to extract context (language for example) and semantics (HTML and ARIA) that will be relevant for these alternate renderings
recommendations for either breaking down longer text into multiple utterances (a paragraph broken down into sentences) or merging multiple text nodes to re-create a full utterance (a single sentence but divided into multiple strings in an FXL resource) that will be passed to the TTS engine
skippability and escapability rules
building a reader mode view from that tree-like structure

sueneu commented 5 months ago

I agree. Building a Reader Mode view from TTS would be an efficient way to give the user choices for accessing the content of a book. A single source would mean consistency between audio mode and visual mode. Using the same code for Reader Mode and TTS would reduce redundant work in Epub production.

A best practice document would be helpful even if TTS doesn't ultimately work out as a basis for Reader Mode. Improved and consistent TTS among reading systems would lower the expense of making an accessible ebook. Publishers who can't create audio overlays could rely on robust TTS to make compliant Epubs. End users who require smaller Epub files would benefit from an audio option without media overlays. And anecdotally, few publishers and users are satisfied with the current TTS experience.

wareid commented 5 months ago

Research to do/Questions to ask:

How do you break things down using the DOM/HTML elements (span, div), particularly non-semantic elements?
What is extracted that is non-textual content? (Alt text, roles)
What kind of semantic structure is extracted? And used?
Could this extracted version be used as a remediation/assessment tool?
How is MathML handled?
Skippability/Escapability/Personalization? (How do we handle the potential elements needing to be skipped/escaped/included in user settings?)

cookiecrook commented 5 months ago

Also overlap with the CSS algo for converting to plaintext. https://www.w3.org/TR/css-text-4/#plaintext

cookiecrook commented 5 months ago

And work in ARIA/AccName...

HadrienGardeur commented 5 months ago

VitalSource seems to have a two-fold approach with a simplified and a detailed reading mode, as described by @rickj in the following comment: https://github.com/w3c/publishingcg/issues/72#issuecomment-1942724261

This is exactly the kind of information that we're looking for to kickstart this joint effort on TTS and reader mode.

w3c / publishingcg

Best practice document: Extracting data for TTS and a "reader mode" #69