Open Enyium opened 1 month ago
In comparison, U+2212 is a dedicated minus sign:
- GitHub, e.g., uses it to display the red "lines deleted" values. Wikipedia and LaTeX equation renderings also use it.
I don't think this would weigh in as a positive factor for us adding it to FromStr for the signed numeric types. The opposite, actually.
It means you see this character in the wild. In #49746, someone said:
I accidentally used a unicode minus sign (−) instead of a dash (-). This happened to me when I pasted a constant from Wikipedia.
This could also happen with input users paste into Rust apps.
Why would it even be an argument for the opposite?!
Because if you copy it from a diff, then for a diff that looks like
- 20
+ 45
these should not be parsed as the integers [-20, 45]
.
You have spaces there between sign and number. Even with a hyphen as a minus sign, this currently gives Err(ParseIntError { kind: InvalidDigit })
, and I wasn't proposing to change that.
In any case, the end user would need to have a basic sense of what they're copying and where they're pasting it.
@Enyium That was only for the sake of readability, a diff can also include
-20
+45
I am only making this observation because I do think your request is reasonable, and I am slightly perplexed why you included extraneous data that seems like it could undermine the strength of your proposal.
I'm not familiar with this, but I want to point out that the Wikipedia article "Plus and minus signs" also talks about ⁒ as a minus sign (
U+2052 COMMERCIAL MINUS SIGN
). Perhaps, this should also be supported. But I don't know whether it's regularly set off from the number with some space character.
There are many alternative numeric notations. There are many alternative "commercial minus signs", not just that one. Almost invariably, such graphemes tend to have many subtle variations or reuses. Extending FromStr
beyond the set of actual "this is a minus sign that looks like the minus sign that Rust already recognizes" would allow people to FromStr
something that looks like %20
and get -20
instead of, say, 0.2
, which could be what they expect, incorrectly or no. And if they come from a context that is not European, they may not expect that glyph but another glyph to be be interpreted as a "minus sign", and then we have to be locale-aware, and... well...
I think it would be inappropriate for Rust to attempt to guess what exact cultural context that the FromStr
impls must live in. In general Rust has strived to be Unicode-aware but locale-agnostic, deferring locale-sensitive tasks to libraries like icu4x. It seems that, in the spirit of this, it would be in our interest to recognize a set of alternative minuses that represent effectively the same symbol, i.e. a different code point but semantically identical and often rendered the same. And yes, many fonts render hyphen-minus identically to U+2212, it's not like there's a law against doing so.
That was only for the sake of readability, a diff can also include
-20 +45
+45
is also already parsed as the number 45, right? Why would that be an argument against supporting something being by definition the minus sign? (Also, your -20
contains a hyphen, which would already be parsed as the number −20, if someone would be to paste it somewhere where Rust's FromStr::from_str()
would be caused to run.)
I have no problem with U+2052 COMMERCIAL MINUS SIGN
(⁒) not gaining support. I just saw it in the Wikipedia article. If it was warranted to support this, which I don't know, adding support for U+2212 MINUS SIGN
would be a good time to add support for this also.
would allow people to
FromStr
something that looks like%20
and get-20
I can't follow you there. Nobody talked about the percent sign. You'd only see U+2052 COMMERCIAL MINUS SIGN
(⁒) when it was intended to be used in the minus role (or when having it to do with gibberish).
In general Rust has strived to be Unicode-aware but locale-agnostic
At least supporting U+2212 MINUS SIGN
should harmonize with that.
it would be in our interest to recognize a set of alternative minuses that represent effectively the same symbol, i.e. a different code point but semantically identical
That's what my issue is about.
I don't know whether ⁒ is used as a sign for negative numbers or only an operator between operands. In the first case, and if it's never set off from the number with a space, maybe support would be warranted.
And yes, many fonts render hyphen-minus identically to U+2212, it's not like there's a law against doing so.
This stood out to me on SoundCloud. The font that they use for remaining play time has a relatively long dash; but it's just a hyphen code-point-wise. But in my perception, my statement holds true for the majority of fonts.
Also, your -20 contains a hyphen, which would already be parsed as the number −20, if someone would be to paste it somewhere where Rust's FromStr::from_str() would be caused to run.)
Does it? I copied it out of the GitHub UI.
Ah, I see. I suppose I misunderstood, then.
Anyway, the problem with the "commercial minus sign" is that the glyphs that semantically mean commercial minus sign include e.g. △ and ▲ if Wikipedia is to be believed. But I know that above and beyond such a meanings, those glyphs definitely have a wide variety of other meanings attributed to them, including in the language which supposedly uses them as commercial minus signs (Japanese).
And Wikipedia goes on to state this about the obelus-like symbol in question:
The symbol is also used in the margins of letters to indicate an enclosure, where the upper point is sometimes replaced with the corresponding number.[1]
The Uralic Phonetic Alphabet uses commercial minus signs to denote borrowed forms of a sound.[1]
In Finland, it is used as a symbol for a correct response (the check mark indicates an incorrect response).[1][5]
So regarding this:
I can't follow you there. Nobody talked about the percent sign. You'd only see U+2052 COMMERCIAL MINUS SIGN (⁒) when it was intended to be used in the minus role (or when having it to do with gibberish).
I, personally, would hesitate to suggest that the Finnish deal in gibberish.
Okay, it's rather strange that something defined as COMMERCIAL MINUS SIGN
is also used in these other manners. So, in the spirit of not supporting in a narrow use case something with such a variety of uses, this code point can be ruled out for support, it seems.
But could I win you over regarding the support for U+2212 MINUS SIGN
?
it would be in our interest to recognize a set of alternative minuses that represent effectively the same symbol, i.e. a different code point but semantically identical and often rendered the same
If that's the goal, it might make sense to use the Unicode compatibility equivalence relation between characters. UAX #15 §1.1:
Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.
That sounds like the property you're describing, and means we don't have to determine our own set. Instead, we would essentially parse from the NFKC normalization of the input string.
I didn't check with any implementation, but visually searching UnicodeData.txt (I did not look at the context-sensitive mappings in SpecialCasing.txt) I believe:
[^sup]: Unicode 1.0 called it SUPERSCRIPT HYPHEN-MINUS [^sub]: Unicode 1.0 called it SUBSCRIPT HYPHEN-MINUS
Although further inspection shows that “smart quotes” aren't considered compatible with "straight quotes" either, so despite my first thought maybe NFKC isn't the correct data to be considering for this purpose after all.
As a minimal bar, ICU does permit changing the character used as the negative affix for their number formatter; asking to recognize alternate negative signs is within the reality of what Unicode recognizes (but the default is still U+002D).
I was, however, unable to locate information on what alternate minus sign affixes are actually in use by locale data, or Unicode information on parsing numbers from text instead of formatting numbers to text. The information probably exists and should be referenced here, but I ran out of time to continue looking for it.
Okay, it's rather strange that something defined as
COMMERCIAL MINUS SIGN
is also used in these other manners.
In that regard I cannot help but agree. Human behavior is very strange.
You can say the following about the character currently exclusively recognized as minus by
FromStr
impls:U+002D HYPHEN-MINUS : hyphen, dash, minus sign
(copied from BabelMap).In comparison, U+2212 is a dedicated minus sign:
U+2212 MINUS SIGN
−
for it.Benefits of adding support:
FromStr
implementations of number types (i32
,f64
etc.) would supportU+2212 MINUS SIGN
in addition toU+002D HYPHEN-MINUS
as a minus sign, UI frameworks, e.g., would have an easier time implementing text boxes that display the typographically more pleasing real minus sign, simply converting the text content to the corresponding number.Result
, wouldn't confuse end users anymore when they pasted a number with Unicode minus into the app and the app showed an error.I'm not familiar with this, but I want to point out that the Wikipedia article "Plus and minus signs" also talks about ⁒ as a minus sign (
U+2052 COMMERCIAL MINUS SIGN
). Perhaps, this should also be supported. But I don't know whether it's regularly set off from the number with some space character.