nvaccess / nvda

NVDA, the free and open source Screen Reader for Microsoft Windows
https://www.nvaccess.org/
Other
2.11k stars 637 forks source link

Unicode "negative squared latin" letters not picked up by normalisation algorithm #17120

Open kara-louise opened 2 months ago

kara-louise commented 2 months ago

Steps to reproduce:

Read the following line of characters

🅻🅸🅲🅴

Actual behavior:

The characters are not normalised. The spoken result depends on the synthesiser used. The Unicode character numbers are sent to a Braille display.

Expected behavior:

If normalisation is enabled, the characters should be normalised.

NVDA logs, crash dumps and other attachments:

n/a

System configuration

NVDA installed/portable/running from source:

installed

NVDA version:

NVDA version 2024.4beta2

Windows version:

Windows 10 Version 22H2 (OS Build 19045.4842)

Name and version of other software in use when reproducing the issue:

n/a

Other information about your system:

Other questions

Does the issue still occur after restarting your computer?

yes

Have you tried any other versions of NVDA? If so, please report their behaviors.

Same issue occurs with alpha-33832,9d15b169 (2025.1.0.33832)

If NVDA add-ons are disabled, is your problem still occurring?

yes

Does the issue still occur after you run the COM Registration Fixing Tool in NVDA's tools menu?

n/a

Adriani90 commented 2 months ago

@LeonarddeR is it possible to extend the algorythm? This impacts other alphanumeric suplimements as well, e.g. regional indicator symbol letters, negative circled letters and some other symbols. Here is the complete list: https://en.wiktionary.org/wiki/Appendix:Unicode/Enclosed_Alphanumeric_Supplement

Most of the symbols in that list work perfectly though.

ABuffEr commented 2 months ago

Hi, not sure if it's useful but, always to extend the algorythm, I noticed that NVDA/Python 3.11.9 unicodedata.unidata_version returns 14.0.0, while this package currently bumps to 15.1.0. Maybe it could be included as external dependency, to keep everything up-to-date.

LeonarddeR commented 2 months ago

It is possible to extend the algorithm by expanding textUtils.unicodeNormalize. What's the idea behind these negative squared letters? I don't think an update of unicodedata will normalize them properly.

kara-louise commented 2 months ago

What's the idea behind these negative squared letters?

@LeonarddeR I assume that the original use for them is for scientific notation since a lot of similar characters are prefixed with the word "mathematical". Not sure why this lot aren't though. How they're used a lot these days (as in the example in my original comment) is another sort of "fancy text". IE to give the appearance of formatted text in places where you can't use it such as in social media screen names. I saw what I pasted in on Mastodon originally. There are websites that will convert what you type into the Unicode characters of your choosing, and presumably that was one of the options.

LeonarddeR commented 2 months ago

@ABuffEr this unicodedata package received its last update over a year ago. I think it is unlikely that that package will fix these cases anyway.

ABuffEr commented 2 months ago

@ABuffEr this unicodedata package received its last update over a year ago. I think it is unlikely that that package will fix these cases anyway.

I imagine because, accordint to this page, Unicode 15.1.0 is the latest version, released in 2023. On the other hand, the 14.0.0 dates back to 2021. Then, ok, I don't know whether this can make any difference here.

Adriani90 commented 2 months ago

@LeonarddeR the enclosed alphanumeric supliments have been added to Unicode since version 5.2, but the last symbols have been added in 2020, so 14.0 should actually already contain all these symbols. Usually they are used to make text stand out visually, in japanese context they are used as well very often, but also in cases such as when indicating country flags etc.

I guess if the normalization algorythm cannot handle them, we would have to add them to the symbols.dic file. right? I mean there are probably about 70 symbols that are not supported so far from this block.

LeonarddeR commented 2 months ago

Adding them to symbols.dic is an option yes, but that will still not normalize them when speaking by word or line. I'd personnally create a Unicode normalization supplementary dictionary in code

sublement = {
    "🅻": "L",
    ...
}

Then feed that to str.maketrans and use the result of a call to str.translate in textUtils.unicodeNormalize.

seanbudd commented 2 months ago

@LeonarddeR - I think these should just go in the symbols dictionary - I don't think a supplementary dictionary for normalization is going to be very maintainable

CyrilleB79 commented 2 months ago

Before discussing a solution, let's focus on the expected result.

Regarding the character "🅻":

SaschaCowley commented 2 months ago

Personally, my preference would probably be to have it replaced by "L" when doing anything but reading by character, but when reading by character have it read as "negative squared latin L".

XLTechie commented 2 months ago

Agreed with @SaschaCowley, that seems the most logical way of covering the needs of most users.

kara-louise commented 1 month ago

The Unicode ASCII add-on by Sukil Etxenike from the Spanish add-ons store is able to sort of normalise the above characters. I said sort of because they appear for some reason as "[L][I][C][E]". I don't know what that add-ons doing differently than other normalisation tools such as the Unicode Normalization Test Page, which can't normalise them. So it might be worth investigating that add-on's source code to see how it works.

Adriani90 commented 1 month ago

That would be inconsistent to the other normalized alphanumeric characters. The full unicode name can be retrieved by an add-on as we do now currently with other normalized characters.Von meinem iPhone gesendetAm 11.09.2024 um 02:14 schrieb Sascha Cowley @.***>: Personally, my preference would probably be to have it replaced by "L" when doing anything but reading by character, but when reading by character have it read as "negative squared latin L".

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

CyrilleB79 commented 1 month ago

Personally, my preference would probably be to have it replaced by "L" when doing anything but reading by character, but when reading by character have it read as "negative squared latin L".

Do you mean with normalization on? This would be inconsistent with the process applied to other normalized characters such as "𝑪". When normalization is enabled, we would expect for "🅻":

Adding "🅻" in a symbol file may allow to achieve something interesting for the user, but it's not a solution to achieve something similar to the UX seen on characters where the normalization is already working.

So if we want to discuss what can be heard when normalization is off (e.g. "negative square letter L"), I'd recommend to do it in a separate issue.

LeonarddeR commented 1 month ago

If these characters go in the symbol dictionary, this has nothing to do with normalization, and normalization will not work when reading by word or line.

ABuffEr commented 1 month ago

Personally, my preference would probably be to have it replaced by "L" when doing anything but reading by character, but when reading by character have it read as "negative squared latin L". Do you mean with normalization on? This would be inconsistent with the process applied to other normalized characters such as "𝑪".

In fact, you get "normalized C", completely missing that is a Bold Italic styled C. Too flatten in my opinion. So, if I understand correctly, I agree with @SaschaCowley, even if it requires a change against the current situation.

Adriani90 commented 1 month ago

In fact, you get "normalized C", completely missing that is a Bold Italic styled C.

That's exactly the expected behavior in this case. Bold, script, squared, circled or what so ever are details that are totally irelevant when reading the text with a screen reader usually, because these properties are in these special cases only for visual purposes. No one would pronounce these characters with their full unicode name. I agree that in some use cases like if you want to write a publication yourself and needs these characters to meet sighted users needs, then you need these properties, but then you can use the character info add-on to get the full unicode name. It is too much verbosity to make the pronounciation according to the unicode standard. That is the experience in Jaws and to be honnest it is horible to explore a publication with such alphanumeric characters and hearing the whole unicode names, even when navigating character by character which is sometimes needed. So the current normalization style as it is in NVDA is the most convenient way to handle these characters. But if this is not achievable with theese alphanumeric supliments, I suggest we should try with the symbols.dic and get at least the characters announced in some situations.

Adriani90 commented 1 month ago

An alternative would be to integrate character info add-on into NVDA itself and report full unicode name by pressing e.g. nvda+coma, but then we still don't have a database of unicode names that is fully translated into several languages.

Adriani90 commented 1 month ago

Another alternative would be to retrieve formating of these characters from unicode and include them into the nvda+f command, so that you can get the formating of a character on demand as well, but this would be a huge workload I guess.

CyrilleB79 commented 1 month ago

@ABuffEr and @Adriani90, have you read https://github.com/nvaccess/nvda/issues/17120#issuecomment-2342791521?

The initial request is that the normalization (as currently implemented in NVDA) also work with some more characters, namely the negative squared letters.

If you wish to discuss other topics, please, please, open a new issue. These new topics include:

Thanks.

XLTechie commented 1 month ago

Do we need three normalizing modes? Off, full name when reading by character, fully normalized always (the current behavior)?

Adriani90 commented 1 month ago

So back to the issue, actually negative in this context means that the letter has an inverted color, so negative letters are white on a dark background. However, even this detail is not important for the screen reader user when reading the text, it is something that could be announced on demand by retrieving the full unicode name.

So if it is possible to include these letters in the normalization similar to all other alphanumeric letters already, this would be ideal.