w3c / ruby-t2s-req

Text to Speech of Electronic Documents Containing Ruby: User Requirements
https://w3c.github.io/ruby-t2s-req/
Other
0 stars 4 forks source link

Reading aloud ruby without reading aloud ruby base #12

Open murata2makoto opened 2 years ago

murata2makoto commented 2 years ago

@aleventhal wrote here:

@murata2makoto , you mentioned that assuming we can announce the <rt> text instead of the base text would be catastrophic, because it could change the meaning. (I admittedly had a hard time understanding why, since the rt is supposed to be an announcement, and my understanding of Japanese is zero). However, are there times that it would be better to read the rt instead of the base text, and is that something we might want to put in author control via a semantic? Right now, I think our semantics would suggest times to read both, but not just the rt text.

Your Japanese colleagues are likely to say that reading aloud rt without reading ruby makes sense. But I know that those who have really tried that do not easily agree. Without ruby base text, morphological analysis is significantly hampered.

Another obstacle is that only the first occurrence has ruby usually. Then, 智子ともこ and 智子 will be read aloud differently.

I think that PLS works much better than ruby for difficult proper nouns or human names. Passing both without double reading is also very nice, if future APIs allow that. Passing ruby without ruby base text to the TTS engine is not a good choice.

aleventhal commented 2 years ago

Without ruby base text, morphological analysis is significantly hampered.

What we expose in the accessibility API's tree structure vs. what is read aloud can be two different things. I'm not sure what the morphological analysis is, but shouldn't that help? All of the information is still available.

Another obstacle is that only the first occurrence has ruby usually. Then, 智子ともこ and 智子 will be read aloud differently.

Very interesting. Forgive all my questions. Should we try to make it so that the repeated texts are discovered and announced the same as the other?

I think that PLS works much better than ruby for difficult proper nouns or human names.

What is PLS?

Passing both without double reading is also very nice, if future APIs allow that.

This is already happening in Chrome.

Passing ruby without ruby base text to the TTS engine is not a good choice.

I'm really suggesting we continue to pass both, but provide the TTS engine with the correct hint of what to read by default, and make the other available via the API.

murata2makoto commented 2 years ago

Without ruby base text, morphological analysis is significantly hampered.

What we expose in the accessibility API's tree structure vs. what is read aloud can be two different things. I'm not sure what the morphological analysis is, but shouldn't that help? All of the information is still available.

I created another issue for discussing this topic.

Another obstacle is that only the first occurrence has ruby usually. Then, 智子ともこ and 智子 will be read aloud differently.

Very interesting. Forgive all my questions. Should we try to make it so that the repeated texts are discovered and announced the same as the other?

In this case, 智子 should be read aloud always the same. Some people argue that TTS engines should keep track of base and ruby. I am not sure if this is the right approach. I mentioned this topic briefly in the new issue mentioned above.

I think that PLS works much better than ruby for difficult proper nouns or human names.

What is PLS?

Pronunciation Lexicon Specification (PLS) is usable from EPUB content documents. (See EPUB 3 Text-to-Speech Enhancements 1.0). It is used by at least one Japanese company for digital textbooks. I think that PLS is particularly useful for human names.

Passing both without double reading is also very nice, if future APIs allow that.

This is already happening in Chrome.

Passing ruby without ruby base text to the TTS engine is not a good choice.

I'm really suggesting we continue to pass both, but provide the TTS engine with the correct hint of what to read by default, and make the other available via the API.

Interesting. Do TTS engines have access to the accessibility tree?

murata2makoto commented 2 years ago

Can we say that reading aloud ruby only is not a good idea and close this issue? Or, should we explicitly say so in the note?

aleventhal commented 2 years ago

Interesting. Do TTS engines have access to the accessibility tree?

The screen reader which drives TTS has access to the AX tree. DAISY readers probably read via the DOM, so they could also have access via the markup.

1) We could give API consumers both pieces of text with a hint of what to read, if we have it. The default could be to read the base text only. The hint rules could be to read the base, the ruby text, or both. 2) The browser or screen reader could apply the same hint rule to other instances of the same ruby base text within that document.

I think we can say reading aloud the ruby text instead of the base text by default is a bad idea. However, it may become a good idea if there is a markup hint to do so, or a reasonably successful heuristic developed that indicates it should be done. What do you think?

murata2makoto commented 2 years ago

Interesting. Do TTS engines have access to the accessibility tree?

The screen reader which drives TTS has access to the AX tree. DAISY readers probably read via the DOM, so they could also have access via the markup.

Thanks. I guess that we shouldn't think too much about those screen readers which have access to text and nothing else.

  1. We could give API consumers both pieces of text with a hint of what to read, if we have it. The default could be to read the base text only. The hint rules could be to read the base, the ruby text, or both.
  2. The browser or screen reader could apply the same hint rule to other instances of the same ruby base text within that document.

This sounds sensible to me, although I continue to think PLS-based initialization would work nicer than ruby for human names or proper names.

I think we can say reading aloud the ruby text instead of the base text by default is a bad idea. However, it may become a good idea if there is a markup hint to do so, or a reasonably successful heuristic developed that indicates it should be done. What do you think?

I started to write a response but haven't finished it yet. Let me try again tomorrow.

aleventhal commented 2 years ago

Thanks. I guess that we shouldn't think too much about those screen readers which have access to text and nothing else.

In other words, text-only ATs don't exist. They always get semantics.

... I continue to think PLS-based initialization would work nicer than ruby for human names or proper names.

There is another effort to allow authors to provide pronunciation rules for things. It might be nice to allow either, because if the author is already supplying Ruby, they may not have time to do the PLS, or may not know how, etc.

I started to write a response but haven't finished it yet. Let me try again tomorrow.

Thanks!

murata2makoto commented 2 years ago

See my last comment in #8.

murata2makoto commented 2 years ago

I am still working on this issue. #15 describes a common problem caused by the use of ruby rather than base characters for T2S.

murata2makoto commented 2 years ago

As of now, the use of ruby for TTS sometimes provides better results than that of base characters. If we make sure that the particles preceding ruby are read aloud correctly (see #15), the result of using ruby for TTS will not be incorrect although unnatural (due to failures of morphological analysis).

Here is my proposal. If the TTS engine can access both the base and ruby, it can first apply morphological analysis to the base and then compare the result against the ruby.

Case 1: phonetic ruby

The TTS engine should use the result if it is consistent with the ruby (i.e., the same kana sequence with accent information added), The TTS engine should use the ruby, otherwise.

I do not think that we need a mode for ignoring the base and relying on the ruby. To me, ruby is just a hint.

Case 2: non-phonetic ruby

The TTS engine should read aloud the base and then read the ruby.

But how can we reliably know whether a given ruby element represents phonetics? I think that we have to rely on explicit markup. If such markup is not available, we cannot know whether the given ruby is non-phonetics or the result of morphological analysis is incorrect.

aleventhal commented 2 years ago

Makoto, because screen reader users may be navigating by character, it might be difficult to provide TTS engines with enough context to add heuristics. The context may be more than the ruby or ruby base — it may include other ruby used within the document, or other clues such as text before/after, or an analysis of the document's text overall. In addition, unless I'm mistaken, there isn't a way to give the TTS engines both pieces of text. Therefore, it may be best to keep the TTS as a "dumb" speech layer, and to have any heuristics or ML in a layer before the TTS, e.g. in the browser, screen reader or DAISY reader. At a higher level, there is more context to decide what to pass to the TTS. What do you think?

As far as case #2, guessing the non-phonetic ruby, I saw at least one example where there were numerical digits in the ruby text. I think it's clear that at least sometimes, a correct guess can be made that improves things. As we know from ARIA, relying on authors to add markup that only affects accessibility means that many, many pages will not include the markup. And let's not forget about existing text documents that never had the markup.

We may be agreeing with each other without realizing it :) I'm not suggesting we only use ML or heuristics. I'm just making sure we understand how they could be applied. The idea would be that the author is encouraged to add extra markup if they want to make sure things work correctly. However, if the markup is not present, as is the case for all texts using ruby today, then there is an opportunity for smart software to improve pronunciations. We cannot know if it's correct, but we can still make improvements.

murata2makoto commented 2 years ago

@aleventhal

In addition, unless I'm mistaken, there isn't a way to give the TTS engines both pieces of text. Therefore, it may be best to keep the TTS as a "dumb" speech layer, and to have any heuristics or ML in a layer before the TTS, e.g. in the browser, screen reader or DAISY reader. At a higher level, there is more context to decide what to pass to the TTS. What do you think?

I guessed that this might be a sad reality. Then, should we give up the use of morphological analysis by browsers? I know that another accessibility-related feature (word-boundary detection in CSS Text Level 4) requires morphological analysis.

murata2makoto commented 2 years ago

In the EPUB case, different HTML content documents in the same EPUB publication might provide ruby for a human name. In other words, "context" might include other HTML documents.

aleventhal commented 2 years ago

It's unlikely that processing in a browser, screen reader or elsewhere would utilize anything other than the current document for context. However, I think it's worth investigating to see how much having the entire document as context could help a heuristic/ML system.

It doesn't necessarily need to affect what semantics are chosen now. For now, it seems enough to know that when semantics are not provided by the author, there is the potential for smart algorithms to fill in the gaps, with a currently unknown degree of accuracy.