NVDA isn't ignoring soft hyphens properly

Michael-Detmers commented 5 years ago

Steps to reproduce:

Save and open the following HTML:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Hyphenation Test</title>
  </head>
  <body>
  <h1>Hooray for Hyphen&shy;ation</h1>
  <p>This text contains some Hyphen&shy;ation. Hope&shy;fully it is not in&shy;com&shy;pre&shy;hen&shy;sible. Usually NVDA can process soft hyphens in documents pretty well, but did you notice the previous word "incomprehensible"?</p>
  <p>Pay attention to the word "pronunciation", which is pronunced differently with soft hyphens in it:</p>
  <p><strong>With: </strong> "pro&shy;nun&shy;cia&shy;tion"</p>
  <p><strong>Without:</strong> "pronunciation"</p>
  <h2>Pro&shy;nun&shy;cia&shy;tion is the key to under&shy;standing the spoken word</h2>
  <p><strong>Did you know?</strong> Soft hyphens have been around since the 80s!</p>  
  <p>The next heading does not contain hyphens.</p>
  <h2>Pronunciation is the key to understanding the spoken word</h2>
  <p>It's a pitty hyphenation cannot be reliably applied via CSS.</p>
  </body>
</html>

Read the document with NVDA.
Access NVDA's Element Browser via the NVDA-key + F7.
In the Element Browser, switch to headings and let NVDA read the entries to you. (Soft hyphens are shown in the display.)

Actual behavior:

Soft hyphens are splitting words and causing odd pronunciations.

Expected behavior:

Soft hyphens are ignored.

System configuration

NVDA installed/portable/running from source:

installed

NVDA version:

2018.4.1

Windows version:

Win 7 64 bit

Name and version of other software in use when reproducing the issue:

Firefox 65.0.2

Other information about your system:

Default language is German (but that shouldn't mattern, should it?)

Other questions

Does the issue still occur after restarting your PC?

yes

Have you tried any other versions of NVDA?

no

Log

nvda.log

DrSooom commented 5 years ago

@Michael-Detmers: Thank you very much for opening this issue, because I planned exactly the same.

But I would suggest that the user should still have the option to enable and disable filtering the soft hyphen (U+00AD) via the Browse Mode NVDA Settings. As a web developer you should have the opportunity to check the correct position of soft hyphens in all web browsers. But normally there isn't any useful benefit for screen reader users regarding this character. And sadly based on the responsive web design this character is often more used. And reading a news article, which contains "hundreds" of them, via speech and/or braille is extremely annoying.

CC: @michaelDCurran, @jcsteh and @MarcoZehe

DrSooom commented 5 years ago

One more question: What shall we do with the new HTML5 tag <wbr>? Any thoughts?

Michael-Detmers commented 5 years ago

@DrSooom One more question: What shall we do with the new HTML5 tag <wbr>? Any thoughts?

Since its purpose is to affect how a line of text is displayed, I'd vote for it to be generally ignored as well. And, as you suggest, there would have to be an option to read all punctuation and special characters verbatim for development and quality assurance purposes.

Neurrone commented 5 years ago

A practical example where this is a huge problem is the CTAN repository for latex packages - for example, the page for the Amsmath package

Michael-Detmers commented 3 years ago

Sadly, no change. On Windows 10 (enterprise 64 bit), Firefox 79, NVDA 2020.2 still chops up words containing soft hyphens. (And neither CSS nor Browsers have provided us with a reliable, universal alternative to soft hyphens yet.)

For the most part, hyphenation is unfortunately needed to meet the WCAG reflow requirements. Without it, long words will simply either flow out of visible areas, overlap each other or - ironically - will also be visually chopped up without any sign of continuation, since the hyphens are missing.

So the current state is this: Either we make it hard to understand for our blind visitors or for our seeing ones. And because I cannot find a suitable WCAG requirement for, so to speak, "avoiding hyphenation", with a heavy heart I still must recomment sticking to the thirty year old unicode control character. I hope widespread support for these typography tools will be available soon, so all users can have a great experience.

SaschaCowley commented 3 years ago

Switching the soft hyphen to be passed to the synthesiser (in Punctuation/symbol pronunciation...) fixes the issue, at least with eSpeak. I'll look into how it goes with other synths and see if I can change that to be the default and make a PR.

SaschaCowley commented 3 years ago

Shouldn't have been so hasty. While eSpeak handles soft hyphens correctly, none of the other synths I have installed (SAPI5, One Core, Eloquence and Vocalizer) do. While handling them correctly probably should be up to the synthesiser, just switching them to be passed directly to the synth is not a very satisfactory solution. I'm not sure that having a setting to strip them is particularly satisfactory either. As a temporary work around, a speech dictionary entry that replaces them with the empty string seems to work (suggested by Ralf Kefferpuetz on the mailing list).

DrSooom commented 3 years ago

LeonarddeR commented 3 years ago

I guess to fix this properly, we need an additional behavior in the speech symbol processor that simply discards the symbol as it wasn't there.

DrSooom commented 3 years ago

@leonardder: Please don't overlook the braille output, as ⠁⠏⠏⠇⠊⠉⠁⠞⠊⠕⠝ is also easier to read instead of ⠁⠏⠏⢤⠇⠊⢤⠉⠁⢤⠞⠊⠕⠝ (⢤ = SHY in German 8-dot), but both are needed depending on the situation (e.g. dictionaries, word processing, web/app development). I already pointed this out in my above linked comment.

LeonarddeR commented 3 years ago

I think handling soft hyphens primarily should be a task of the braille table. In the Dutch 8 dot table for example, we ignore it completely.

DrSooom commented 3 years ago

In the Dutch 8 dot table for example, we ignore it completely.

This is imho highly unwanted for the reasons I mentioned above because you cannot check a correct position of the SHY character if you cannot use TTS at the same time. And TTS will here also work only correct if you navigate character by character which is time consuming. Thus not really comfortable.

As you already know me in such situations: The end user should have the force to change this behavior – not only (liblouis) devs for them. And issue #10634 also handles additional Unicode characters, which should be ignored in braille and speech output at the same way. So it's easier to add SHY (U+00AD) to that list as well.

masi commented 3 years ago

Please, ignore SHY. For German we need lots of shoft-hyphens. In many projects we need automated hyphenation which escalates this problem.

julianladisch commented 3 years ago

The issue is fixed when setting "Punctuation/symbol level" to "some" in the Speech settings. I use NVDA version 2020.4.

masi commented 3 years ago

I don't understand why removing soft-hpyhens is not desirable. Normally it is invisible and shuld not be announced. And if it is shown it should IMHO not be announced either. It conveys absolutely no information related to the contents.

julianladisch commented 3 years ago

Removing soft hyphens is desirable.

NVDA ignores and doesn't announce soft hyphens when the user has set "Punctuation/symbol level" to "none" or "some" in the NVDA Speech settings. "some" is the default. NVDA announces "soft hyphen" for each soft hyphen when the user has set "Punctuation/symbol level" to "most" or "all" in the NVDA Speech settings.

An NVDA user might want to verify the correct positioning of the soft hyphens and therefore needs an option to make NVDA announce them.

Microsoft Word has a similar setting, the Show/Hide Paragraphs option: Word's help explains: "Show paragraph marks and other hidden formatting symbols. This is especially useful for advanced layout tasks." This option shows optional hyphens. The fact that this option exists proves that there are valid use cases for revealing hidden symbols, for example proofreading including formatting symbols.

DrSooom commented 3 years ago

@julianladisch: Which TTS synthesizers are you using? And which languages?

masi commented 3 years ago

An NVDA user might want to verify the correct positioning of the soft hyphens and therefore needs an option to make NVDA announce them.

That makes sense. My fault that I didn't think of that use case.

So it boils down to the question whether soft hyphens should be announced with setting "most" or only in "all"? Or they could get their own settings. After all proof-reading is probaly not what users do all the time.

eigenstil commented 3 years ago

Just a quick reminder: this issue is NOT about the announcement of the "shy" character. It is about the odd pronounciation of the whole words, where "shy" is used.

Adriani90 commented 3 years ago

cc: @michaelDCurran

julianladisch commented 3 years ago

Thank you for clarification.

The steps to reproduce should be extended:

Disable soft hyphen pronunciation in the punctuation/symbols level settings and the Symbol Pronunciation settings. This works, NVDA doesn't say "soft hyphen".

The "Actual behavior" should be:

The pronunciation is the same as if each soft hyphen were replaced by a space. NVDA incorrectly pronounces pronunciation like pro nun cia tion. NVDA incorrectly pronounces each syllable as a separate word. NVDA incorrectly pronounces the cia syllable as CIA (Central Intelligence Agency).

I confirm this bug.

RichCaloggero commented 3 years ago

The behavior I'm seeing is this:

you can change the symbol level in the symbol dic to whatever you want, this should help with when you need to proofread
real issue is that any symbol always causes a word break, even if it's replacement is set to the null string
the behavior we'd like, especially for the soft hyphen case, is that when we set replacement to the null string, NVDA processes the entire word as if the symbol doesn't exist at all (just uses it's default word break characters)

We could enhance this to only omit the symbol completely when certain conditions are met, such as level is none or character and "send symbol to synthesizer" is set to never.

masi commented 2 years ago

No progress on this? Soft hyphens have been around for ages and are a must-have for many languages - ok, for German at least. Just because English tends to have short words the problem should not be dismissed. Maybe it isn't, after all the issue has not been closed.

seanbudd commented 1 year ago

See also #13668

thomasmoon commented 1 month ago

Same in Finnish and Swedish @masi , this is a bit need and really against all specs that they are pronounced.

Been hangin on to all hope that we can use soft hyphens which are so important while also maintaining our accessibility requirements that are surely so important for other Europeans at this point due to the new EC directive and all the languages with long compound words.

VoiceOver does it great, doesn't that put the fire under you to improve this product that so many people rely on? 🔥 😉

LeonarddeR commented 1 month ago

To summarize this issue, I think to bring this further, we need to do the following:

Change the level for the soft hyphen to character
Ensure that when level is character and send to synthesizer is never, the character is ignored completely when not reading by character.

I'd personally leave braille out of the discussion for now, though my standpoint is still that this is the translator's responsibility, otherwise we're very likely getting into routing issues.

Adriani90 commented 1 month ago

Hmm, if the soft character is still navigable in character by character navigation, this will also affect the word by word navigation still. I think it might be worth thinking about a checkbox in the browse mode settings to ignore soft hyphens completely when navigating through the virtual document. It is a small additional setting I know, but it seems to have big impact.

LeonarddeR commented 1 month ago

Adding an extra option to browse mode settings isn't as impactless as you may think. Filtering characters from TextInfo is never trivial, even not with browse mode. Furthermore, character navigation should represent reality. If there is a soft hyphen in the text, I want to see that with character nav, just because it is there.

Adriani90 commented 1 month ago

How is this displayed visually? Are there any visual spaces instead of the hyphen themselves? If yes, I agree with you. But still the UX will be confusing when people navigate word by word while the word is splited into several parts.Von meinem iPhone gesendetAm 31.05.2024 um 17:47 schrieb Leonard de Ruijter @.***>: Adding an extra option to browse mode settings isn't as impactful as you may think. Filtering characters from TextInfo is never trivial, even not with browse mode. Furthermore, character navigation should represent reality. If there is a soft hyphen in the text, I want to see that with character nav, just because it is there.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

DrSooom commented 1 month ago

In computing and typesetting, a soft hyphen (Unicode U+00AD SOFT HYPHEN ()) or syllable hyphen, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they fall on the line end but remain invisible within the line.

Source: https://en.wikipedia.org/wiki/Soft_hyphen

Adriani90 commented 1 month ago

That means words with soft hyphens in the middle of the line are not stretched apart visually by any means? Is this correct?Von meinem iPhone gesendetAm 31.05.2024 um 18:29 schrieb Daniel Mayr @.***>:

In computing and typesetting, a soft hyphen (Unicode U+00AD SOFT HYPHEN ()) or syllable hyphen, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they fall on the line end but remain invisible within the line.

Source: https://en.wikipedia.org/wiki/Soft_hyphen See also my previous comments, e.g. #9343 (comment), #9343 (comment), #9343 (comment) and #10634 (comment)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

DrSooom commented 1 month ago

That means words with soft hyphens in the middle of the line are not stretched apart visually by any means? Is this correct?

Yes, based on my knowledge 20 years ago. I worked with soft hyphens visually during my time at business school. Additional quotation from Wikipedia:

It serves as an invisible marker used to specify a place in text where a hyphenated break is allowed without forcing a line break in an inconvenient place if the text is re-flowed. It becomes visible only after word wrapping at the end of a line.[4] The soft hyphen's Unicode semantics and HTML implementation are in many ways similar to Unicode's zero-width space, with the exception that the soft hyphen will preserve the kerning of the characters on either side when not visible. The zero-width space, on the other hand, will not, as it is considered a visible character even if not rendered, thus having its own kerning metrics.

Furthermore, there is an option in Microsoft Word 2010 (others yet not checked) and LibreOffice Writer 7.6 (others yet not checked) to toggle the screen visibility regarding specific characters like spaces, non-breaking spaces, tabulators and soft hyphens. In other words: The end user (or creator) must be able to see them to be able to check their correct position and the end user (or consumer) must also be able not to see them, which makes it easier to visually read a document. And the exact same option should be available in NVDA, as I already pointed out five years ago.

eigenstil commented 1 month ago

Here is a codepen (not be me) which illustrates the use of soft hyphens compared to "normal" hyphens: https://codepen.io/InSightGraphics/pen/KKaMEr

thomasmoon commented 1 month ago

How is this displayed visually? Are there any visual spaces instead of the hyphen themselves? If yes, I agree with you. But still the UX will be confusing when people navigate word by word while the word is splited into several parts.Von meinem iPhone gesendetAm 31.05.2024 um 17:47 schrieb Leonard de Ruijter @.>: Adding an extra option to browse mode settings isn't as impactful as you may think. Filtering characters from TextInfo is never trivial, even not with browse mode. Furthermore, character navigation should represent reality. If there is a soft hyphen in the text, I want to see that with character nav, just because it is there. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

The soft hyphens are completely invisible under most circumstances but enable the word to wrap when space is limited, in which case the hyphen is shown giving a visual indication to users that the word has wrapped at the container boundary. In most cases this would not be useful information for non-visual users. Perhaps if a developer wants to inspect the characters this could be enabled with an option like other editor's (Word, VS Code) "display special characters" (navigate soft-hyphens) but for regular users, the words should be read naturally, ignoring the characters completely.

A common case for these soft-hyphens would be in a heading text for example in compound words, so that on a mobile view longer words can be broken at the correct locations (between syllables or component words).

Css hyphens: auto works fairly well in English, but not in other languages. Hyphen positioning is dictionary based and browsers have their own proprietary implementations. For important cases, it's possible to deliberately affect the points at which a word wraps by manually inserting the &shy character, but this is too much to expect from content managers, so using a library like Hypen is usually the best way to ensure breakpoints are applied consistently.

Soft hyphens are therefore a useful tool allowing the best possible display of content dynamically and responsively in different circumstances, but the fact that NVDA reads these aloud means they can't actually be used anywhere.

Adriani90 commented 1 month ago

@LeonarddeR reading the comments above, in this case I think there should be an optional setting in NVDA voice settings called "speak and navigate word wrapping characters" or something like that. This would at least be consistent with braille settings as well, where word wrapping can be tunred on and off.

eigenstil commented 1 month ago

@LeonarddeR reading the comments above, in this case I think there should be an optional setting in NVDA voice settings called "speak and navigate word wrapping characters" or something like that. This would at least be consistent with braille settings as well, where word wrapping can be tunred on and off.

I totally agree with this. And I would like to add, that this setting should be switched OFF by default. This makes it possible, to use the  element in HTML without breaking the generated vocal output on a lot of websites in languages other than english.

LeonarddeR commented 1 month ago

As far as I can see, the only major issue with soft hyphens currently is that they break up words when speaking them. It's pretty evident that they shouldn't. Apart from that, I don't think anything should be done within the scope of this issue. Let's not make it more complex than necessary.

DrSooom commented 1 month ago

@LeonarddeR: I think fixing issue #9343 and issue #10634 at once would make more sense – and more work of course. But in the end we will be able to add more characters, which visually not visible like zero-width space (U+200B), but currently still sent to the TTS and to braille output. See: https://en.wikipedia.org/wiki/Zero-width_space

What we need is a list of characters similar to the symbols.dic (tsv file). These characters are removing directly after the string was sent to NVDA and before the string is sent by NVDA to braille translation and speech symbol and word dictionaries. And during this process, the total of all removed characters must be count and their positions must be stored in a temp array to fix braille routing problems.

The end user should be able to define, which characters should not be sent to speech and/or to braille output by enabling or disabling their checkboxes. He should also be able to add and remove characters to this list of ignored characters like it is the case with the NVDA GUI for the symbols.dic yet.

Word-by-word navigation with CTRL+ArrowLeft/ArrowRight would be another problem, which in my opinion cannot be fixed by NVDA at all. But if I remember correctly, when you pressed ArrowRight in Microsoft Word, the visible cursor didn't change its visible position on moving through a soft hyphens – as long as the option for visually showing soft hyphens is disabled, which is normally the case. But this behavior could be changed within the last 20 years. So, please check this, as I'm using Microsoft Word/LibreOffice Writer since 2011 less than five times a year. Therefore my memories regarding this could be incorrect or no longer correct.

CyrilleB79 commented 1 month ago

Please not that #13668 has been closed as duplicate but contained useful information, more specifically the link provided in https://github.com/nvaccess/nvda/issues/13668#issuecomment-1122143209.

So I have just checked visually:

Unicode character 173 = 0xAD

This is the soft hyphen used in HTML () as explained in the initial description of this issue.

In Firefox, it is not visible except when it is at the end of a line. It's worth noting that the end of the line in the virtual buffer does not match the end of the line displayed visually.
I have not tested other browsers
In Word, this character is always visible and has no specific meaning.

Character 31 = 0x1F

It is a control character that Word calls "soft hyphen" and uses as such, i.e. it is not visible except when it is the last character of a line.

Conclusion

Please do not mix the two characters.

nvaccess / nvda