Text Citations and footnotes using ReadAloud can be disruptive to comprehension

GeorgeKerscher commented 9 months ago

Description

When using the ReadAloud function in a Reading System, or when a screen reader is being used, text citations in the text can be disruptive to reading comprehension. The same disruption occurs if a footnote is read where it occurs. The concept of skipability and escapeability has been discussed using SMIL and media overlays, but when using ReadAloud or with a screen reader has not yet been addressed.

This feature request originated in the EPUB Reading Systems accessibility testing, but it is not accessibility specific. We are requesting that the Publishing Community Group take up this issue. It relates to best practices for markup and having the feature in Reading Systems and with screen readers.

rickj commented 9 months ago

Here is what we do (user selection on specifics):

EPUB Text Finder This section describes the algorithm that the TTS engine uses to find the text to read, and how it breaks it up into paragraphs. (In the code they’re called “Utterances” - terminology borrowed from Apple’s speech APIs.)

The contents of the following tags are always ignored regardless of the filter settings: script, noscript, style, object, noframes

The following list of tags are considered to be paragraph separators: br, hr, tr, td, p, blockquote, title, ul, ol, li, table, pre, div, h1, h2, h3, h4, h5, h6, article, section, figcaption, figure, dl, dt, dd, aside, address, header, nav, footer, hgroup, caption

In addition to locating the text, filtering types can be added to each utterance. The application can filter on any of these flags (for example, exclude figures and altText). Some of these filters are very simple and are applied to certain tags (for example, figures). Alt text is more complex, as the alt text utterances are extracted from a variety of tags and attributes. The accessibility text for math is similarly complex - we attempt to extract the accessibility description of the math and read that if possible (tagged with the math filter).

The ‘altText’ Filter This filter reads (or excludes) image alt text. Alt text is extracted from:

The alt attribute of an img element.
The aria-label attribute of any element where role == “img”.

The ‘figure’ Filter A paragraph has the figure filter if it is contained in an element where any one of these is true:

The element is figure
The epub:type attribute is figure

The ‘table’ Filter A paragraph has the figure filter if it is contained in an element where any one of these is true:

The element is table
The epub:type attribute is table

The ‘citation’ Filter A paragraph has the citation filter if it is contained in an element where the epub:type is one of the following:

endnote
endnotes
footnote
footnotes
bibliography

The ‘uriLink’ Filter The purpose of this filter is to edit out links where the text of the link is the linked URL. (As opposed to links where the text of the link is just text - these we always still want read; it’s just a part of the text with a link applied.)

To detect this, we look for a elements, and extract both the element’s text and the href attribute.

If the href begins with “mailto:” (then it is an email link). Compare the rest of the href (removing “mailto:”) with the text - if they match, apply the uriLink filter.
If the href begins with “http:” or “https:” then:
- Trim the string “text” - remove all leading and trailing whitespace.
- See if the resulting string has at least one internal period and no internal whitespace. If so, then it looks URL-ish. See if this string is a substring of href. If it is, then apply the uriLink filter.

The ‘math’ Filter Math is one of the trickiest filters, because we extract the accessibility description from several different places. In place of the math element, we include the accessibility description and apply the “math” filter to it, so it can be excluded or included.

If the element name is math and it contains an attribute named altText (the MathML standard attribute for math alt text), then the utterance returned is the altText value and it is marked with the math filter.
If the element name is img and it has a a role element where role == “math” and the img contains an alt attribute, then the utterance returned is the alt attribute value, and it is marked with the math filter.
If the element name is NOT img and it has a role element where role == “math” and the element contains an aria-label attribute, then the utterance returned is the aria-label value, and it is marked with the math filter.
(Note: This is specific to our mathjax preprocessor code, when using chtml output): If the element is a span or div and its class contains “mjx-chtml”, then look for an aria-label on the parent of the element. The utterance returned is the aria-label value, and it is marked with the math filter.
(Note: This is specific to our mathjax preprocessor code, when using svg output): If the element is a span or div and its class contains “vst-math-wrapper” - we look for a child of the element with the svg tag and look for the svg’s aria-label attribute. The utterance returned is the aria-label value, and it is marked with the math filter.

PDF Filtering PDF has similar logic to the above when the PDF is a tagged PDF. Paragraph separators are described by the standard set of block-level PDF tags. The tags that define the PDF filters are described as follows:

The PDF ‘altText’ Filter This filter detects if the pdf object is an image that has alt text defined. If so the utterance is the alt text.

The PDF ‘figure' Filter This filter detects if an utterance is contained in the Figure tag.

The PDF ‘table’ Filter This filter detects if an utterance is contained in the Table tag.

The PDF ‘caption’ Filter This filter detects if an utterance is contained in the Caption tag.

The PDF ‘math’ Filter This filter detects if an utterance is contained in the Formula tag.

The PDF ‘citation’ Filter This filter detects if an utterance is contained in the Note or FENote tag.

wareid commented 9 months ago

This is directly related to what is proposed in #69 and I wonder if we should just combine these two issues.

With "read aloud" are we referring to text-to-speech, screen reader output, media overlays, or a combination of the three?

rickj commented 9 months ago

With "read aloud" are we referring to text-to-speech, screen reader output, media overlays, or a combination of the three?

Those are three different domains that cannot share a common solution:

Media Overlays - dependent on the intersection of a properly marked up EPUB and a Reading System that supports media overlays
Screen Reader Output - Assistive technology is a 'black box', outside of the control of a reading system. Changes here would need to be targeted differently
TTS (Read Aloud)- This is inside the control of a reading system... and we should come up with a recommended approach (like the above! )

GeorgeKerscher commented 9 months ago

I believe that for the RS to give the reader the option to avoid unwanted spoken information, the content would need to be marked up. In particular, where the author is referencing a formal citation, if this were marked up, the RS ReadAloud function could give the option to skip it.

The option of skipping the reading of footnotes could also be skipped. Here doc-footnote could be skipped.

I believe this skipping could also be implemented by screen readers.

In the case of SMIL, if the content was marked up, then this could be identified in the SMIL markup.

GeorgeKerscher commented 9 months ago

Yes, This issue is directly related to #69, but it is much simpler to implement.

If we create a best practice for marking citations, I think there is enough general markup to resolve this issue.

RS systems could simply add the option of what to skip in their ReadAloud.

For example in ReadAloud skip : citations Footnotes Alt text page numbers

These could be toggled. In textbooks, reading of pages I would want, but in a novle, it would be disruptive. People should be able to choose.

wareid commented 9 months ago

With "read aloud" are we referring to text-to-speech, screen reader output, media overlays, or a combination of the three?

Those are three different domains that cannot share a common solution:

Media Overlays - dependent on the intersection of a properly marked up EPUB and a Reading System that supports media overlays

Screen Reader Output - Assistive technology is a 'black box', outside of the control of a reading system. Changes here would need to be targeted differently

TTS (Read Aloud)- This is inside the control of a reading system... and we should come up with a recommended approach (like the above! )

So the problem I'm having here is that I have never seen TTS referred to as read aloud, I've actually seen this terminology from publishers in the context of media overlays in places like the description of the book, or in the context of a specific learning style (there is a lot of content out there on "Read Aloud" practices that include media overlays or teach parents how to read aloud).

In user discussions, we have also only heard users refer to either TTS or "reader mode", not "read aloud". I want to make sure we're using accurate and precise language, and unify on it, so we avoid confusion on both the publisher side (where I currently see a lot of confusion between the methods), and the user side.

sueneu commented 9 months ago

Publishers that I work with use "Read Aloud" to mean text-to-speech. Perhaps we define our terms in any resulting spec.

Media Overlays: Audio or video files embedded in an ebook Text-to-Speech (TTS): Audio generated from text by an ebook reading system. Screen Reader Output: Audio generated from text by user-selected assistive technology that is separate from the ebook reading system.

clapierre commented 9 months ago

Hi @sueneu in the definitions you provided I have heard folks use "Read Aloud" in-place of what you have defined "Text-to-Speech (TTS)" because in reading systems the button you press is "Read Aloud" not "TTS".

sueneu commented 9 months ago

@clapierre well, that explains some of the confusion!

I've been told by developers that "Read Aloud" refers only to synchronized media overlays. So there is some variation within the industry. For that reason, we should be careful to define terms in documentation.

Could we define "Read Aloud" as any audio expression of the text no matter what technology (ie. TTS, media overlays) is used? You could easily make the argument that the user needn't be aware of how the audio is generated.

Documentation for publishers, producers, and reading systems could further define the underlying tech.

mattgarrish commented 9 months ago

You might want to refer to the guide @GeorgeKerscher wrote: https://www.w3.org/publishing/a11y/audio-playback/

It gets into the confusion around the Read Aloud v. Read Now naming.

wareid commented 9 months ago

But @mattgarrish, the document you're referring to explicitly makes the difference between media overlays "Read Aloud" and TTS as separate features.

There's a big gulf between the two features, and in many cases, completely different sub-features between the two. Most SMIL implementations don't allow you to adjust reading speed for instance, and SMIL allows the publisher to customize text highlighting, but TTS implementations do not. Not to mention the different audio, SMIL is most often a human narrator where TTS is computer generated. I think it's really important to be clear about what the user is going to experience. Especially in cases where both options might be available for a title.

EDIT: I also think it's important to point out that the two features have completely different origins, one is publisher-driven and provided, the other is reading-system driven.

mattgarrish commented 9 months ago

the document you're referring to explicitly makes the difference between media overlays "Read Aloud" and TTS as separate features

It's defining "full audio" publications as those that use media overlays and TTS for the reading system/AT-generated playback, regardless of what names are assigned to those features in different reading systems. When you start using generic names like "read aloud" it means different things to different people. I'm only pointing it out as a means of standardizing the language used to talk about the issue.

HadrienGardeur commented 8 months ago

From a reading app perspective, I'm not sure that there's always a need to identify TTS and media overlay as two completely different affordances.

Framing this as a User Story: "As a user, I would like to listen to an ebook and have sufficient control over that experience".

The following preferences/features can apply to both of them:

play/pause/stop
skip to next/previous utterances
highlight colour
speed
continous playback (this mostly applies to FXL content, where you might want to automatically pause the playback until the reader moves forward to the next page/spread)
skippability could apply to both as Media Overlay/SMIL also provides semantic information that can be used to skip specific utterances

While Media Overlay can come with their own CSS class for highlighting, this authored preference could prove problematic to some users and it makes sense to always offer the ability to customize things. I would need to double check but as far as I can remember, this is also optional in EPUB, which means that reading systems need a way to handle highlighting if it isn't authored in the file anyway.

For reading speed, it's well known by now that many users want the ability to tweak things to their own liking. This goes beyond ebooks/audiobooks, since podcast and video apps often offer this option as well (there are many people watching anime at a higher speed for example).

I believe that this eventually comes down to two key differences:

Media Overlay may provide a higher quality audio experience, if it's recorded by a real human narrator (TTS could also be used to mass produce such files)
and the way content is broken down into utterances (more control from the reading system with TTS)

As TTS becomes better and better, I believe that the barrier between the two of them will continue to break down. Just earlier this week, I read an article about Storytel providing TTS as an alternative option in a number of audiobooks that they provide: https://www.boktugg.se/2024/02/27/rostbytaren-storytel-lanserare-voice-switcher-pa-svenska/

The key argument being: "A whopping 89% of Storytel's listeners have at some point finished a book, not because the book was bad, but because the voice didn't suit them".

dalerrogers commented 8 months ago

Hadrian:

Everything mentioned could be accomplished via JS. Modern browsers could accomplish all of the above. When working with ePUB reader, do they allow JS, limited to document control? Do they have and require an API? Is DOM and JS the API?

I wonder. Since EPUB is a website wrapped in a package, would a simpler solution be to allow web browsers the ability to see inside a ZIP archive and read ePUB files, rather than wait for the readers to catch up?

Best Regards,

Dale Rogers, M.Ed., CIW Designer eLearning Developer @.*** http://dalerogers.me/ https://www.linkedin.com/in/dalerrogers/

From my iPhone. Pardon my thumbs.

From: Hadrien Gardeur @.> Sent: Friday, March 1, 2024 10:26:30 AM To: w3c/publishingcg @.> Cc: Subscribed @.***> Subject: Re: [w3c/publishingcg] Text Citations and footnotes using ReadAloud can be disruptive to comprehension (Issue #72)

From a reading app perspective, I'm not sure that there's always a need to identify TTS and media overlay as two completely different affordances.

Framing this as a User Story: "As a user, I would like to listen to an ebook and have sufficient control over that experience".

The following preferences/features can apply to both of them:

play/pause/stop
skip to next/previous utterances
highlight colour
speed
continous playback (this mostly applies to FXL content, where you might want to automatically pause the playback until the reader moves forward to the next page/spread)
skippability could apply to both as Media Overlay/SMIL also provides semantic information that can be used to skip specific utterances

While Media Overlay can come with their own CSS class for highlighting, this authored preference could prove problematic to some users and it makes sense to always offer the ability to customize things. I would need to double check but as far as I can remember, this is also optional in EPUB, which means that reading systems need a way to handle highlighting if it isn't authored in the file anyway.

For reading speed, it's well known by now that many users want the ability to tweak things to their own liking. This goes beyond ebooks/audiobooks, since podcast and video apps often offer this option as well (there are many people watching anime at a higher speed for example).

I believe that this eventually comes down to two key differences:

Media Overlay may provide a higher quality audio experience, if it's recorded by a real human narrator (TTS could also be used to mass produce such files)
and the way content is broken down into utterances (more control from the reading system with TTS)

As TTS becomes better and better, I believe that the barrier between the two of them will continue to break down. Just earlier this week, I read an article about Storytel providing TTS as an alternative option in a number of audiobooks that they provide: https://www.boktugg.se/2024/02/27/rostbytaren-storytel-lanserare-voice-switcher-pa-svenska/

The key argument being: "A whopping 89% of Storytel's listeners have at some point finished a book, not because the book was bad, but because the voice didn't suit them".

— Reply to this email directly, view it on GitHubhttps://github.com/w3c/publishingcg/issues/72#issuecomment-1973490307, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOCKEQLKGH42LMDGYPHTITYWCT3NAVCNFSM6AAAAABDG7X6QKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZTGQ4TAMZQG4. You are receiving this because you are subscribed to this thread.Message ID: @.***>

HadrienGardeur commented 8 months ago

@dalerrogers you're right that this can be entirely done in JS and in fact that's what a number of reading apps do (mostly the ones that are Web Apps since there are better native options available for dividing text into utterances and then reading these utterances using a TTS engine).

In such cases, the JS handling all of that is served by the reading app though, not the publication. That's consistent with using Edge's TTS feature, which works on every website that you visit. I think Chrome has something similar in testing as well.

The main issue when implementing TTS with Web technologies right now is mostly related to inconsistencies across implementations of lower level API in browsers. For example Intl.segmenter is very useful to divide text into utterances but it's not available on Firefox yet (it's in nightly builds, which is good news though). Getting a boundary event from SpeechSynthesisUtterance is very useful to follow the progression in how an utterance is being read, but support is also lacking and/or inconsistent across the board.

dalerrogers commented 2 months ago

Hello all:

I sent this request to Avneesh Singh as well. So, I apologize in advance for cross-posting. I’m trying to track down an error as I am publishing my first fixed-layout eBook to Amazon, Kobo, and Ingram Spark.

The short version:

I’m hand-coding the EPUB so I know exactly what is in there. I’m a front-end coder and have taught HTML/CSS for 18 years. I am CIW certified. I understand markup. I’m using VS Code as my editor so there shouldn’t be any odd hidden characters. It’s plain text, coded as UTF-8.

I ran my EPUB package through the latest EPUB checker (version 5.1.0 according to the CHANGELOG.txt file). It validates. I ran it through the Daisy ACE checker. It validates. It opens and displays as designed on my iBooks, and Kindle apps on my MacBook Air, iPad Pro, and iPhone. So far, so good.

During the submission process to Ingram Spark and Kobo Writing Life, I’m getting the error:

Error while parsing file: [attribute "class" not allowed here; expected attribute "dir", "version" or "xml:lang"] in OEBPS/title-page.xhtml, line 2

Line 2 contains the following code…

I ran the code and errors through ChatGPT to see if AI could help me isolate the issue. It recommended I confirm the versions of EPUB check being used. Good idea. According to KOBO documentation (https://github.com/kobolabs/epub-spec/blob/master/README.md#epub-versions-kobo-supports), it uses EPUB checker version 4.2.4. I validated my file with EPUB checker version 5.1.0. Is that what is throwing my Kobo and Ingram Spark errors? Should I ignore the Ingram Spark and Kobo validator warnings and proceed with confidence that my validators are more current? Has anyone else run into this? My project is a comic book. The intended audience is sighted readers. Still, I want everyone to enjoy the experience, and the image alt attributes have rich descriptions of all the panels. Is this an issue to be reported? What is the workaround or guidance to get my book published? Best Regards, Dale Dale R Rogers, M.Ed, CIW Creator | Designer | Educator Personal: ***@***.******@***.***> Web: dalerogers.me

dalerrogers commented 2 months ago

Try this:

________________________________ From: Dale Rogers ***@***.***> Sent: Sunday, September 1, 2024 5:19 PM To: w3c/publishingcg ***@***.***>; W3C EPUB 3 Community Group ***@***.***> Subject: EPUB validation and publishing platforms Hello all: I sent this request to Avneesh Singh as well. So, I apologize in advance for cross-posting. I’m trying to track down an error as I am publishing my first fixed-layout eBook to Amazon, Kobo, and Ingram Spark. The short version: I’m hand-coding the EPUB so I know exactly what is in there. I’m a front-end coder and have taught HTML/CSS for 18 years. I am CIW certified. I understand markup. I’m using VS Code as my editor so there shouldn’t be any odd hidden characters. It’s plain text, coded as UTF-8. I ran my EPUB package through the latest EPUB checker (version 5.1.0 according to the CHANGELOG.txt file). It validates. I ran it through the Daisy ACE checker. It validates. It opens and displays as designed on my iBooks, and Kindle apps on my MacBook Air, iPad Pro, and iPhone. So far, so good. During the submission process to Ingram Spark and Kobo Writing Life, I’m getting the error: Error while parsing file: [attribute "class" not allowed here; expected attribute "dir", "version" or "xml:lang"] in OEBPS/title-page.xhtml, line 2 Line 2 contains the following code… I ran the code and errors through ChatGPT to see if AI could help me isolate the issue. It recommended I confirm the versions of EPUB check being used. Good idea. According to KOBO documentation (https://github.com/kobolabs/epub-spec/blob/master/README.md#epub-versions-kobo-supports), it uses EPUB checker version 4.2.4. I validated my file with EPUB checker version 5.1.0. Is that what is throwing my Kobo and Ingram Spark errors? Should I ignore the Ingram Spark and Kobo validator warnings and proceed with confidence that my validators are more current? Has anyone else run into this? My project is a comic book. The intended audience is sighted readers. Still, I want everyone to enjoy the experience, and the image alt attributes have rich descriptions of all the panels. Is this an issue to be reported? What is the workaround or guidance to get my book published? Best Regards, Dale Dale R Rogers, M.Ed, CIW Creator | Designer | Educator Personal: ***@***.******@***.***> Web: dalerogers.me

dalerrogers commented 2 months ago

Hi Dale

I suggest you add xml:lang=“en” to that line.

This is an example of a typical html tag for a fixed layout EPUB content document.

Thanks Ken Ken Jones Director Circular Software Limited circularsoftware.com ***@***.******@***.***> linkedin.com/in/kenjones On 1 Sep 2024, at 23:19, Dale Rogers ***@***.***> wrote: Hello all: I sent this request to Avneesh Singh as well. So, I apologize in advance for cross-posting. I’m trying to track down an error as I am publishing my first fixed-layout eBook to Amazon, Kobo, and Ingram Spark. The short version: I’m hand-coding the EPUB so I know exactly what is in there. I’m a front-end coder and have taught HTML/CSS for 18 years. I am CIW certified. I understand markup. I’m using VS Code as my editor so there shouldn’t be any odd hidden characters. It’s plain text, coded as UTF-8. I ran my EPUB package through the latest EPUB checker (version 5.1.0 according to the CHANGELOG.txt file). It validates. I ran it through the Daisy ACE checker. It validates. It opens and displays as designed on my iBooks, and Kindle apps on my MacBook Air, iPad Pro, and iPhone. So far, so good. During the submission process to Ingram Spark and Kobo Writing Life, I’m getting the error: Error while parsing file: [attribute "class" not allowed here; expected attribute "dir", "version" or "xml:lang"] in OEBPS/title-page.xhtml, line 2 Line 2 contains the following code… I ran the code and errors through ChatGPT to see if AI could help me isolate the issue. It recommended I confirm the versions of EPUB check being used. Good idea. According to KOBO documentation (https://github.com/kobolabs/epub-spec/blob/master/README.md#epub-versions-kobo-supports), it uses EPUB checker version 4.2.4. I validated my file with EPUB checker version 5.1.0. Is that what is throwing my Kobo and Ingram Spark errors? Should I ignore the Ingram Spark and Kobo validator warnings and proceed with confidence that my validators are more current? Has anyone else run into this? My project is a comic book. The intended audience is sighted readers. Still, I want everyone to enjoy the experience, and the image alt attributes have rich descriptions of all the panels. Is this an issue to be reported? What is the workaround or guidance to get my book published? Best Regards, Dale Dale R Rogers, M.Ed, CIW Creator | Designer | Educator Personal: ***@***.******@***.***> Web: dalerogers.me

w3c / publishingcg

Text Citations and footnotes using ReadAloud can be disruptive to comprehension #72

Description