rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.23k stars 12.71k forks source link

Number types' `FromStr` impl should recognize Unicode minus #130315

Open Enyium opened 1 month ago

Enyium commented 1 month ago

You can say the following about the character currently exclusively recognized as minus by FromStr impls:

In comparison, U+2212 is a dedicated minus sign:

Benefits of adding support:

I'm not familiar with this, but I want to point out that the Wikipedia article "Plus and minus signs" also talks about ⁒ as a minus sign (U+2052 COMMERCIAL MINUS SIGN). Perhaps, this should also be supported. But I don't know whether it's regularly set off from the number with some space character.

workingjubilee commented 1 month ago

In comparison, U+2212 is a dedicated minus sign:

  • GitHub, e.g., uses it to display the red "lines deleted" values. Wikipedia and LaTeX equation renderings also use it.

I don't think this would weigh in as a positive factor for us adding it to FromStr for the signed numeric types. The opposite, actually.

Enyium commented 1 month ago

It means you see this character in the wild. In #49746, someone said:

I accidentally used a unicode minus sign (−) instead of a dash (-). This happened to me when I pasted a constant from Wikipedia.

This could also happen with input users paste into Rust apps.

Why would it even be an argument for the opposite?!

workingjubilee commented 1 month ago

Because if you copy it from a diff, then for a diff that looks like

- 20
+ 45

these should not be parsed as the integers [-20, 45].

Enyium commented 1 month ago

You have spaces there between sign and number. Even with a hyphen as a minus sign, this currently gives Err(ParseIntError { kind: InvalidDigit }), and I wasn't proposing to change that.

In any case, the end user would need to have a basic sense of what they're copying and where they're pasting it.

workingjubilee commented 1 month ago

@Enyium That was only for the sake of readability, a diff can also include

-20
+45
workingjubilee commented 1 month ago

I am only making this observation because I do think your request is reasonable, and I am slightly perplexed why you included extraneous data that seems like it could undermine the strength of your proposal.

I'm not familiar with this, but I want to point out that the Wikipedia article "Plus and minus signs" also talks about ⁒ as a minus sign (U+2052 COMMERCIAL MINUS SIGN). Perhaps, this should also be supported. But I don't know whether it's regularly set off from the number with some space character.

There are many alternative numeric notations. There are many alternative "commercial minus signs", not just that one. Almost invariably, such graphemes tend to have many subtle variations or reuses. Extending FromStr beyond the set of actual "this is a minus sign that looks like the minus sign that Rust already recognizes" would allow people to FromStr something that looks like %20 and get -20 instead of, say, 0.2, which could be what they expect, incorrectly or no. And if they come from a context that is not European, they may not expect that glyph but another glyph to be be interpreted as a "minus sign", and then we have to be locale-aware, and... well...

I think it would be inappropriate for Rust to attempt to guess what exact cultural context that the FromStr impls must live in. In general Rust has strived to be Unicode-aware but locale-agnostic, deferring locale-sensitive tasks to libraries like icu4x. It seems that, in the spirit of this, it would be in our interest to recognize a set of alternative minuses that represent effectively the same symbol, i.e. a different code point but semantically identical and often rendered the same. And yes, many fonts render hyphen-minus identically to U+2212, it's not like there's a law against doing so.

Enyium commented 1 month ago

That was only for the sake of readability, a diff can also include

-20
+45

+45 is also already parsed as the number 45, right? Why would that be an argument against supporting something being by definition the minus sign? (Also, your -20 contains a hyphen, which would already be parsed as the number −20, if someone would be to paste it somewhere where Rust's FromStr::from_str() would be caused to run.)

I have no problem with U+2052 COMMERCIAL MINUS SIGN (⁒) not gaining support. I just saw it in the Wikipedia article. If it was warranted to support this, which I don't know, adding support for U+2212 MINUS SIGN would be a good time to add support for this also.

would allow people to FromStr something that looks like %20 and get -20

I can't follow you there. Nobody talked about the percent sign. You'd only see U+2052 COMMERCIAL MINUS SIGN (⁒) when it was intended to be used in the minus role (or when having it to do with gibberish).

In general Rust has strived to be Unicode-aware but locale-agnostic

At least supporting U+2212 MINUS SIGN should harmonize with that.

it would be in our interest to recognize a set of alternative minuses that represent effectively the same symbol, i.e. a different code point but semantically identical

That's what my issue is about.

I don't know whether ⁒ is used as a sign for negative numbers or only an operator between operands. In the first case, and if it's never set off from the number with a space, maybe support would be warranted.

And yes, many fonts render hyphen-minus identically to U+2212, it's not like there's a law against doing so.

This stood out to me on SoundCloud. The font that they use for remaining play time has a relatively long dash; but it's just a hyphen code-point-wise. But in my perception, my statement holds true for the majority of fonts.

workingjubilee commented 1 month ago

Also, your -20 contains a hyphen, which would already be parsed as the number −20, if someone would be to paste it somewhere where Rust's FromStr::from_str() would be caused to run.)

Does it? I copied it out of the GitHub UI.

Enyium commented 1 month ago

In diffs, GitHub uses the hyphen (like this code point is also used in code instead of fancy characters); but this text is also in a monospace font. On a page like this, I was referring to the red number on the top right (not in a monospace font).

workingjubilee commented 1 month ago

Ah, I see. I suppose I misunderstood, then.

Anyway, the problem with the "commercial minus sign" is that the glyphs that semantically mean commercial minus sign include e.g. △ and ▲ if Wikipedia is to be believed. But I know that above and beyond such a meanings, those glyphs definitely have a wide variety of other meanings attributed to them, including in the language which supposedly uses them as commercial minus signs (Japanese).

And Wikipedia goes on to state this about the obelus-like symbol in question:

The symbol is also used in the margins of letters to indicate an enclosure, where the upper point is sometimes replaced with the corresponding number.[1]

The Uralic Phonetic Alphabet uses commercial minus signs to denote borrowed forms of a sound.[1]

In Finland, it is used as a symbol for a correct response (the check mark indicates an incorrect response).[1][5]

So regarding this:

I can't follow you there. Nobody talked about the percent sign. You'd only see U+2052 COMMERCIAL MINUS SIGN (⁒) when it was intended to be used in the minus role (or when having it to do with gibberish).

I, personally, would hesitate to suggest that the Finnish deal in gibberish.

Enyium commented 1 month ago

Okay, it's rather strange that something defined as COMMERCIAL MINUS SIGN is also used in these other manners. So, in the spirit of not supporting in a narrow use case something with such a variety of uses, this code point can be ruled out for support, it seems.

But could I win you over regarding the support for U+2212 MINUS SIGN?

CAD97 commented 1 month ago

it would be in our interest to recognize a set of alternative minuses that represent effectively the same symbol, i.e. a different code point but semantically identical and often rendered the same

If that's the goal, it might make sense to use the Unicode compatibility equivalence relation between characters. UAX #15 §1.1:

Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.

That sounds like the property you're describing, and means we don't have to determine our own set. Instead, we would essentially parse from the NFKC normalization of the input string.

I didn't check with any implementation, but visually searching UnicodeData.txt (I did not look at the context-sensitive mappings in SpecialCasing.txt) I believe:

[^sup]: Unicode 1.0 called it SUPERSCRIPT HYPHEN-MINUS [^sub]: Unicode 1.0 called it SUBSCRIPT HYPHEN-MINUS

Although further inspection shows that “smart quotes” aren't considered compatible with "straight quotes" either, so despite my first thought maybe NFKC isn't the correct data to be considering for this purpose after all.


As a minimal bar, ICU does permit changing the character used as the negative affix for their number formatter; asking to recognize alternate negative signs is within the reality of what Unicode recognizes (but the default is still U+002D).

I was, however, unable to locate information on what alternate minus sign affixes are actually in use by locale data, or Unicode information on parsing numbers from text instead of formatting numbers to text. The information probably exists and should be referenced here, but I ran out of time to continue looking for it.

workingjubilee commented 1 month ago

Okay, it's rather strange that something defined as COMMERCIAL MINUS SIGN is also used in these other manners.

In that regard I cannot help but agree. Human behavior is very strange.