Chinese numerals are not recognized by char::is_numeric

wooster0 commented 3 years ago

I tried this code:

fn main() {
    assert!('一'.is_numeric());
}

I expected it to evaluate to true.

Instead, it evaluated to false.

I would expect at least 零/〇、一、二、三、四、五、六、七、八、九 (0-9) to be recognized. As for other numeral systems, like the Arabic numerals, after 9 the number wouldn't fit into a char anymore and thus can't be recognized, but with Chinese numerals, beyond 0-9 there's many other numbers represented with a single character too, like for example 10: 十, which could still be a char. I'm not sure whether this should be recognized, but perhaps it should. There is also financial numbers and many others, see https://en.wikipedia.org/wiki/Chinese_numerals#Standard_numbers for a comprehensive list.

I've been told that, the numerals are covered in the UnicodeData.txt file mentioned in the docs of char::is_numeric, but they are listed in the Lo category which stands for Other Letter and so Rust doesn't consider them numeric, which doesn't make sense to me because clearly they are numerals and not letters. Rust should probably either recognize (some parts of) this category as numerals or the numerals should be added manually.

Adding support for this would in turn also mean support for numerals of other East Asian languages, like Japanese and Hokkien.

Meta

This happens on the stable 1.51.0 channel and all others.

CryZe commented 3 years ago

It seems like Unicode classifies it as a Lo / Letter Other with no numeric value. So I don't think this is a Rust specific problem and rather a problem with the Unicode standard (if even).

ChrisDenton commented 3 years ago

Unfortunately I think this would be a breaking change even if it is desirable. As you say, Unicode classifies them as "other letters, including syllables and ideographs" instead of one of the number categories (see Unicode categories). The Rust documentation says specifically that the code point must be in the Nd, Nl or No categories so any change to that would be breaking the current Rust API.

The only way I see around that is to do one of the following:

Persuade Unicode to reclassify them for the next version of the Unicode specification.
Create a new standard library function that isn't tied to the Unicode specification.
Use a third pary crate to properly classify number regardless of the Unicode specification or Rust's std implementation. This could also handle number that take more than one code point.

eggyal commented 3 years ago

Whilst the General_Category of these characters is indeed Lo, their Numeric_Type property is Numeric. Perhaps that's what should be inspected instead, although as @ChrisDenton says this would be a breaking change to Rust.

wooster0 commented 3 years ago

Whilst the General_Category of these characters is indeed Lo, their Numeric_Type property is Numeric. Perhaps that's what should be inspected instead

Then I suppose this is probably more of a Rust-side issue than Unicode's problem.

Persuade Unicode to reclassify them for the next version of the Unicode specification.

I contacted them about this.

Create a new standard library function that isn't tied to the Unicode specification.

What would this method be called? I believe it would cause more confusion than do good.

Yes, I suppose this will mean a breaking change, but is there not some kind of special exception for bugs? This is not a new feature but an important bug and correctness matters. I don't think we should leave this bug be just because it might break something. We should also consider how much of a breaking change this really is.

I think what we can do maybe is to basically add a warning for some time to all is_numeric calls to tell the user about the incoming change and tell them that they should check if that doesn't break something for them?

But really, I don't know what this could actually break except for some very specific cases. I believe it would rather fix more things than break things. Also, we have the next edition.

Maybe this can make it into the next 2021 Edition? I think that makes this "breaking change" even more acceptable to be done. Maybe the warning for is_numeric calls can exist a few releases before the 2021 edition.
Or perhaps when the user switches to the 2021 edition for the very first time (and that special behavior only exists in the first release of the new edition, or for a few more than that) there will be some kind of "special" dialog/message for is_numeric calls and the compiler tells the user to check those calls and make sure nothing breaks there? But perhaps that's a little too complicated and there's better ways, but I'm sure there's a very to get this change done.

ChrisDenton commented 3 years ago

What's been done before is for a new function to be created with a similar name and the old function be deprecated. Perhaps is_num, is_numeric_type, has_numeric_property or something else. Unless the libs team decides there's enough wiggle room to change the current function.

In either case the new functionality should, I think, be based on DerivedNumericType.txt which also tells how they should be derived from other Unicode data. DerivedNumericValues lists each value with the numeric value property.

wooster0 commented 3 years ago

What's been done before is for a new function to be created with a similar name and the old function be deprecated. Perhaps is_num, is_numeric_type, has_numeric_property or something else.

If that's what the libs team decides for, how about is_numeral? In any case a new method name would mean less consistency with the other methods having numeric in their name such as is_alphanumeric and is_ascii_alphanumeric, so I hope it can be avoided and the existing method gets a change.

In either case the new functionality should, I think, be based on DerivedNumericType.txt which also tells how they should be derived from other Unicode data. DerivedNumericValues lists each value with the numeric value property.

That sounds good. And perhaps that doesn't just fix Chinese numerals not being detected but other specific cases too.

Let's wait for what the libs team thinks.

workingjubilee commented 3 years ago

In spite of being "breaking" changes, small fixes in the implementation of a type are permitted by RFCs 1105 and 1122, for the same reason that it would obviously behoove Rust to fix the compiler if rustc suddenly started solving 0 + 1 for -3. In saying such, I am not immediately weighing in on whether this is such an error in need of fixing, merely noting that in principle this could be deemed such a change.

inquisitivecrystal commented 3 years ago

@rustbot claim

It seems like everyone agrees that we either need a new API for this, or we need to change the existing one. It's for the libs team to decide which. In the meantime, there's some background work I can get started on.

inquisitivecrystal commented 2 years ago

Now that the prerequisite work is done, nominating for T-libs-api consideration.

Problem statement: is_numeric checks the general category of a character to determine whether it is numeric. This behavior is guaranteed by its documentation. Unfortunately, this does not work for Han characters, where it is necessary to either check the Unihan database or, more conveniently, the extracted/DerivedNumericType.txt UCD file. Without checking one of these sources, Han characters will not be categorized as numbers, because their general category is Lo (letter, other), rather than one of the numeric categories.

Question: Should we (1) provide a new function for checking whether a character is a number that works for Han characters, possibly deprecating is_numeric; (2) change is_numeric to work for Han characters, on the basis that that behavior is more sensible; or (3) do nothing (which I wouldn't recommend, as the current behavior is kind of odd).

wooster0 commented 2 years ago

I vote for 2.

BurntSushi commented 2 years ago

cc @Manishearth @SimonSapin do either of you have an opinion on this? Are there other folks we should CC?

One question I have is whether Unicode specifically has a recommendation for these sorts of APIs. I'm not aware of any.

It might also make sense to do a survey of how other standard libraries implement "is numeric" predicates.

ChrisDenton commented 2 years ago

For background information, see Chapter 4 of the Unicode standard. Specifically 4.5 and 4.6.

Basically there are two relevant things here. Each code point in the Unicode database belongs to a General Category. They can only belong to one category. From the standard (Chapter 4.5):

There are several other conventions for how General_Category values are assigned to Unicode characters. Many characters have multiple uses, and not all such uses can be captured by a single, simple partition property such as General_Category. Thus, many letters often serve dual functions as numerals in traditional numeral systems. Examples can be found in the Roman numeral system, in Greek usage of letters as numbers, in Hebrew, and similarly for many scripts. In such cases the General_Category is assigned based on the primary letter usage of the character, even though it may also have numeric values, occur in numeric expressions, or be used symbolically in mathematical expressions, and so on.

Code points also have properties. Specifically the Numeric_Type derived property groups every code point into one of four types: Decimal, Digit, Numeric or None (which is the default).

See DerivedNumericType.txt for the data. Quoting from the standard (Chapter 4.6):

The Numeric_Type = Decimal property value (which is correlated with the General_Category = Nd property value) is limited to those numeric characters that are used in decimal radix numbers and for which a full set of digits has been encoded in a contiguous range, with ascending order of Numeric_Value, and with the digit zero as the first code point in the range

Decimal digits, as defined in the Unicode Standard by these property assignments, exclude some characters, such as the CJK ideographic digits (see the first ten entries in Table 4-5), which are not encoded in a contiguous sequence... Traditionally, the Unicode Character Database has given these sets of noncontiguous or compatibility digits the value Numeric_Type = Digit, to recognize the fact that they consist of digit values but do not necessarily meet all the criteria for Numeric_Type = Decimal. However, the distinction between Numeric_Type = Digit and the more generic Numeric_Type = Numeric has proven not to be useful in implementations. As a result, future sets of digits which may be added to the standard and which do not meet the criteria for Numeric_Type = Decimal will simply be assigned the value Numeric_Type = Numeric.

Quick overview from browsing docs of a few languages:

Rust

pub fn is_numeric(self) -> bool

Returns true if this char has one of the general categories for numbers.

The general categories for numbers (Nd for decimal digits, Nl for letter-like numeric characters, and No for other numeric characters) are specified in the Unicode Character Database UnicodeData.txt.

Python

str.isnumeric()

Return True if all characters in the string are numeric characters, and there is at least one character, False otherwise. Numeric characters include digit characters, and all characters that have the Unicode numeric value property, e.g. U+2155, VULGAR FRACTION ONE FIFTH. Formally, numeric characters are those with the property value Numeric_Type=Digit, Numeric_Type=Decimal or Numeric_Type=Numeric.

.NET

public static bool IsNumber (char c);

Valid numbers are signified by the Unicode designation "Nd" (number, decimal digit), "Nl" (number, letter), "No" (number, other).

[Note: I'm paraphrasing so as to remove a level of indirection]

Go

func IsNumber(r rune) bool

IsNumber reports whether the rune is a number (category N).

Manishearth commented 2 years ago

Précis: I don't think we should change the existing API, and I don't consider the existing API a "bug" beyond perhaps a less ambiguous choice of naming having been possible. I'm open to adding a new one to choose between. I do not think deprecation is the right way unless we are adding two functions.

Prelude: Handling text in a cross-language way

Okay, so the main thing about text is that handling text cross-language is an incredibly hard problem. This is not due to Unicode; this is an intrinsic property of the vast conceptual diversity in text. Did you know that Unicode does not even attempt to define "character"? There's no single definition of the term that applies uniformly to all writing systems. More often than not what people call a "unicode problem" is actually just an intrinsic problem with trying to stuff this conceptual diversity into little boxes.

In other words:

tired: "this behavior has been present since unicode 1.0"

wired: "this behavior has been present in the technology of writing ever since the first misguided mesopotamian decided to make some scratches on a rock seven thousand years ago"

Typically my first reaction to almost every such question is "what are you actually trying to do?". With international text, and therefore with Unicode, often people are attempting to apply their intuitions from the writing systems they are familiar with and assuming concepts apply uniformly elsewhere. They're almost always wrong about that, and a bunch of work in teasing that out is to figure out what operation they are actually looking for. E.g. the operation "split a word into letters" does not make sense in general, but "split a word into letters for showing cursors to the user" or "split a word into letters for taking the first letter out for making acronyms" or "split a word into letters for backspace to work" do make more sense (and are different operations!).

That's my reaction here as well. What do we actually want when we provide is_numeric? Unfortunately, we're an API, not an end user, so we can't quite answer that. What we can do is figure out the range of possibilities and document that better.

Numbers in Chinese

(Going to try to keep the examples here in Mandarin, but my Cantonese is way better and I'm somewhat translating between the two, please pardon any mistakes)

In modern Chinese, you see numbers done in two ways. Either (western) Arabic numerals are used (e.g. "請給我55塊錢", "please give me fifty five bucks"), or Chinese numerals are used (e.g. " "請給我五十五塊錢", where "五十五" is "five ten five", or "fifty five").

There's a bit of a distribution on what's used when. Chinese numerals tend to be used in sentences when counting stuff (eg 我有五本書 "I have five books"), sometimes for money (as above) and dates (eg 今天是五月五號 "today is May 5"). Western Arabic numerals tend to be used when talking about dates (eg 今天是5月5號 "today is May 5") and money, and almost always for phone numbers.

It gets even wrinklier when you account for the fact that there's a separate character for saying "two" when you're counting stuff

Note that the numeral 55 in text would be read identically as 五十五 (i.e. not as the english word "fifty-five", but rather as "wǔshíwǔ" or "ng⁵ sap⁶ ng⁵" or whatever).

The system cycles around every myriad, so the number 555,555,555,555 would be 五千五百五十五億,五千五百五十五萬,五千五百五十五 (I have added the commas in for illustration, they are never used), using the same characters for 1000, 100, and 10 every cycle, but using new characters to mark every power of a myriad. We do something similar in English with powers of a thousand, e.g. 555,555,555 would be "five hundred fifty five million, five hundred fifty five thousand, five hundred and fifty five", reusing the same words for "hundred" and the "-ty" morpheme.

Note that when talking about years and phone numbers Chinese numerals tend to be used similarly to Western ones: The year 2022 is written as 二零二二年 ("two zero two two year") not 二千二十二年 ("two thousand twen-ty two year"), and similarly with phone numbers. This is because years and phone numbers are typically spoken as a sequence of digits rather than a single number. We do the same thing in English when we read out phone numbers, though we're kinda haphazard about years, often reading them as two blocks ("twenty twenty two" or "nineteen sixty-five").

Are they numerals?

Why am I saying all of this? Well, you might want to call 零/〇，一，二（+ 兩？），三，四，五，六，七，八，九，百，千，萬，億，。。。"numerals" but one can easily argue that they are "words" (and "letters", if you're trying to shoehorn that concept to logographic writing systems) or "morphemes".

In other words you can say that "五十五" is closer to saying "fifty-five" than it is to saying "55". The mere fact that for numbers greater than ten you are spelling out the word as spoken (rather than just writing a compact digit-by-digit representation) is a pretty clear indicator that these are words. Half the characters in the chinese representation of "555,555,555,555" are not "5"!

The fact that they can be written "as a sequence of digits" as done in years and phone numbers because that's how those are spoken aloud, further bolsters their status as letters, to me. They are used in a way that character-for-character corresponds to the spoken word, either spelling out a number with tens and hundreds and stuff, or spelling out a sequence of digits without them, depending on what is needed in the situation.

An interesting example (credit kourge) is the idiom 一五一十, which is comprised purely of numerals, but is not a number; it is an idiom (specifically, a chengyu) meaning "in full detail". Note that this is distinct from a number having an idiomatic meaning (like "420" in English), this is a sequence of numerals that do not form a number that really are just words forming an idiom. Similar to saying "ten-four"/"10-4" in English, while it's comprised of numbers, "ten-four" is not a number.

No matter how you slice it, they are not just numerals, they are at the very least words/letters as well.

What's a "numeral"?

Here's a strawman set of consistent cross-language features for what are considered numerals:

Used to represent a number
Are heterograms due to their usage as numerals

When I say they are heterograms, I mean that they are pronounced differently in context. For example, "5" is pronounced "five", but "55" is not simply pronounced "five five". Nor is "5^th" pronounced "five-th"¹. We use numerals as building blocks to denote words that have to be read as wholes, not as a sum of their parts.

This set of criteria would determine western (and eastern) Arabic numerals to be numerals, as well as Roman numerals, but not Chinese numerals. It gets a bit hard to apply to Japanese which is already chock-full of heterograms, but you can be more specific about the "due to their usage as numerals" to make it work.

But that's an illustrative strawman, to demonstrate the kind of work it takes to precisely define a concept of "numeral" cross language. As mentioned before, all of this depends on the use case.

What should we do here?

To me, is_numeric() has a pretty important property in its current form: it does not clash with is_alphabetic(). I think it would be extremely incorrect to exclude Chinese numerals from is_alphabetic(), they are alphabetic, within the already strained metaphor of applying "alphabeticyness" to non-alphabets. They are just also numericy. This means that including them in is_numeric() is potentially confusing.

I think is_numeric() should be left as is, with documentation noting that ideographic numbers are considered alphabetic, not numeric. As mentioned in the first section, the important thing to do is ask what people actually plan to do with this, and since we're an API the best we can do is help the user ask themselves that question with good docs.

It might be worth adding a second function that handles the Numeric Type Unicode property. I'm not sure what it should be called.

I think the technically correct set of naming from a unicode standpoint would probably be that is_numeral() would handle general category and is_numeric() would handle "Numeric Type", but besides being a breaking change I don't think this technical correctness gets us that much because the distinction only really makes sense if you're already super familiar with these nuances (and Unicode terminology, which isn't perfect). Furthermore, there is a ton of symmetry between is_alphabetic() and is_numeric() and we'd lose that.

If we could go back and change things I think a potential avenue to have explored would have been to have two functions where neither has a shorter name (so they feel on equal footing) so people are forced to contend with this. I'm not actually sure if that's a good idea, mind you, I just think it would have been worth exploring further.

¹ In fact, this is one of the few cases in English where you get to see japanese-style heterogram disambiguation: it's not even that the 5 in 5^th is pronounced "fif" or something (how would you extend this explanation to "1^st"?). The 5 in 5^th is pronounced "fifth", and the ^th is a disambiguation mark with no pronunciation of its own, telling you how to pronounce the previous letter. The symbol "5" just happens to have a whole bunch of pronunciations in English, including "five", "fifth", "fifty", etc. This is similar to how in Japanese, 読む "read", pronounced "yomu" is not actually 読 "yo" + む "mu" it is actually 読 "yomu" + む "telling you that the previous character shoudl take the pronunciation ending in 'mu' instead of the three other pronunciations", since 読 has multiple pronunciations.

BurntSushi commented 2 years ago

Thanks @Manishearth. Probably the best and fastest response to a ping than I could ever hope for haha. What you say about focusing on actual use cases is exactly what I was hoping to see. :)

I think is_numeric() should be left as is, with documentation noting that ideographic numbers are considered alphabetic, not numeric. As mentioned in the first section, the important thing to do is ask what people actually plan to do with this, and since we're an API the best we can do is help the user ask themselves that question with good docs.

Given everything you said, I'm inclined to agree here. And absolutely in favor of better docs (perhaps even including some portion of your comment) giving folks more info to decide with.

It might be worth adding a second function that handles the Numeric Type Unicode property. I'm not sure what it should be called.

Aye yeah I'd potentially be open to this but would definitely like to see some concrete use cases motivating it. (And ideally, those would become part of the docs for this new method.)

Manishearth commented 2 years ago

Yeah I think the way to go about it is to ask around for a use case and then design such a function with a name appropriate for the use case.

eggyal commented 2 years ago

I wonder whether stdlib should simply provide a more generic method for querying Unicode properties, through which users can establish for themselves whether Numeric_Type = Numeric (or whether a given char has any other interesting Unicode property) instead of trying to capture every possible nuance of use cases in its API? Certainly this delegates understanding/applying Unicode properties to users, but this will probably be required to some extent anyway.

Manishearth commented 2 years ago

I'd rather that exist in separate crates; wanting to query Unicode properties is a pretty niche thing.

To some extent, even these methods are kinda niche and Rust has them because people expect them, not because they are necessarily the only way such methods would make sense.

inquisitivecrystal commented 2 years ago

I wonder whether stdlib should simply provide a more generic method for querying Unicode properties, through which users can establish for themselves whether Numeric_Type = Numeric (or whether a given char has any other interesting Unicode property) instead of trying to capture every possible nuance of use cases in its API? Certainly this delegates understanding/applying Unicode properties to users, but this will probably be required to some extent anyway.

One major disadvantage of this is that anything we add to std is going to end up in our binaries. We're already going to Herculean efforts to take up as little space as possible for the information we do have. Adding the ability to query any property would increase binary size, but would likely not be used by most applications.

inflation commented 2 years ago

To further complicate the situation, consider the character "幺" (U+5E7A). It has the Numeric property and listed in the DerivedNumericType.txt. But it is used not only as a digit for "one" when reading phone numbers to prevent ambiguity ("幺幺零" for 110, the emergency line),but also as a surname and an adjective for least ("幺妹" for youngest, "least aged", sister).

This ambiguity is inherent from the language and cannot be distinguished without a context. People would be very surprised when they hit by this.

I propose an API with name like can_be_numeric() so at least users know there are chances that the character is not used as numeric here.

gbraad commented 2 years ago

Everyday Chinese do not use these numbers to express prices. Only for formal use. Like a year. 二〇二二年. Which includes a different zero... Is this a numeral?

Are they numerals?

The problem is that then also 百 (100)，千(1000)，万(10000) and so on needs to be recognized. Is 一百 (100) and 十万 (100.000) numeric? Yes... but they can't stand on their own. You can't say: '万' representing an actual numeral as they need to be preceded by a number.

What if the financial numbers are used, what is called capitalized numbers 大写? This all introduces a lot more complexity for not much gain. These are the financial equivalent character (大写): 零 0, 壹 1, 贰 2, 叁 3, 肆 4, 伍 5, 陆 6, 柒 7, 捌 8, 玖 9, 拾 10, 佰 100, 仟 1000, 萬 10000, 億 1000000

I'd rather that exist in separate crates;

Agree with @Manishearth and @inquisitivecrystal that this better served by a crate, and not for a standard library. As this is pretty much like an timezone problem. You wanna keep this outside of the language as it is a very different complexity. And hate to say it, also subject to the whim of a culture.

Note: wasn't able to see the whole conversation on my phone. Sorry if some got duplicated. The comment is also not meant to be snarky, as the timezone have some great examples. Interpretations can change over time.

bstrie commented 2 years ago

I think we should hew as closely to Unicode concepts as possible here. Not necessarily because Unicode is always brilliant, but rather because it's standard, and also because in general we're unlikely to do any better than it does. And if people don't want full Unicode support bloating up the stdlib, then by all means leave it to an external library (though I'd love a potential future where I can treat libstd as though it were an ordinary crate with Cargo features, so I can say std = { features = ["unicode"] } to get full Unicode support out of the box).

pmf commented 2 years ago

Raku might serve as an example for @bstrie's suggestion; it exposes Unicode properties as described here: https://www.codesections.com/blog/raku-unicode/

For the given example:

[0] > say '一'.uniprop('Numeric_Type');
Numeric
[0] > say 'a'.uniprop('Numeric_Type');
None

IMO this is the only sensible way to avoid having weird Unicode vs. language API inconsistencies, and it shifts the discussion from fuzzy cultural and philosophical questions to "just do what Unicode does".

MaeIsBad commented 2 years ago

After seeing this issue I looked through other usages of is_numeric in public git repos and most of them I would consider to be bugs, mostly due to people mistaking is_numeric for is_digit. I think if the rust team decides to deprecate and create a new function they should also use this opportunity to try to make sure the new name makes it more explicit that numeric characters aren't just digits but also include characters like ① or ¾. Maybe something along the lines of is_number_like?

Examples of the bugs I mentioned: [1] https://github.com/eirproject/eir/blob/master/libeir_ir/src/text/parser/lexer.rs#L363 [2] https://github.com/mozilla/application-services/blob/main/components/nimbus/src/versioning.rs#L203 [3] https://github.com/bluejekyll/trust-dns/blob/main/crates/proto/src/rr/domain/name.rs#L544 [4] https://github.com/coding-horror/basic-computer-games/blob/main/18_Bullseye/rust/src/lib.rs#L195 [5] https://github.com/warycat/rustgym/blob/master/leetcode/src/d7/_736_parse_lisp_expression.rs#L99

cbeuw commented 2 years ago

I fundamentally disagree with Unicode's decision to classify code points as "numeric or not". It is, in general, impossible to determine if a code point (not even a grapheme cluster a.k.a. "perceived character"!) is used as a numeral without context, because human languages are very high up the Chomsky hierarchy, to put it mildly.

They don't even need to look very far to realise this. Is "I" a numeral? Well it isn't at the beginning of my comment, but it is on my clock.

... And to fix this, they added new codepoints specifically for Roman numerals, except that for the past two thousand years people didn't treat them as characters separate from the ones in the Latin alphabet, so the new codepoints are entirely "made up" so to speak, and people rarely use them (centuries are written in Roman numerals in many European languages so Roman numerals come up quite frequently. I don't think we should tell the Parisians that they are typing it wrong....)

As for Chinese, well, "一万" is a number (ten thousand) and both characters should be numerals, but "万一" (roughly "just in case") isn't a number, are the characters still numerals? And there's also the issue that we don't always have Chinese characters in Unicode, we have Unified CJK characters, the same character may always be a numeral in one language, but never in another.

ChrisDenton commented 2 years ago

After seeing this issue I looked through other usages of is_numeric in public git repos and most of them I would consider to be bugs, mostly due to people mistaking is_numeric for is_digit.

I think this suggests that the documentation for is_numeric should mention is_digit. As it stands, the is_digit function compares it with is_numeric and then says "for a more comprehensive understanding of ‘digit’, see is_numeric()". But then is_numeric just gives a bland statement about "general categories", which won't necessarily mean much to the reader and then a link to UAX #44 which does give more information but you have to wade through a lot to get it.

MaeIsBad commented 2 years ago

Maybe we should move this discussion into a separate issue. I think most of the time people make this mistake it's from them typing char.num and seeing is_numeric pop up as a first suggestion from their IDE. Adding a mention of is_digit in the docs is a good idea but I don't think that's gonna stop most causes of this mistake

gbraad commented 2 years ago

As @cbeuw points out, Unicode is not a reliable source. It provides information, but no conclusive answer. It is way more about context, completeness, etc. and this is all subject to a lot of nuance

As for Chinese, well, "一万" is a number (ten thousand) and both characters should be numerals, but "万一" (roughly "just in case") isn't a number,

even "万" by itself does not have a real meaning as it needs a preceding numeral to make it meaningful and correct.

Unified CJK characters

This points clearly why this would not work in a generalized way. It would make a lot more sense to allow people to use a library to solve their specific problem.

Your example about Roman numerals is an interesting one. I can't recall a language that interprets this as a standard function of the top of my head. and the libraries I have seen are sometimes even mistaken. Even worse, Is IIII a valid numeral? If you ask watchmakers, a lot of them will say yes. Context...

cbeuw commented 2 years ago

even "万" by itself does not have a real meaning as it needs a preceding numeral to make it meaningful and correct.

To make it more fun, "千万" can mean either "ten(s) of million" or "please make certain". You need to go well into NLP territory to figure out which "千万" in "千万富翁千万不能乱花一千万元" ("millionaires must not waste ten million yuan") is a numeral

While abstractions are nice, there is a limit. We cannot abstract the unabstractable, or find a common ground where none exists.

gbraad commented 2 years ago

I am not a native speaker of Chinese (though my wife is), I think this is a question to ask a native Chinese speaker who has experience with programming. Does this make sense implement?

I think the previous edit make this clear:

We cannot abstract the unabstractable, or find a common ground where none exists.

voronoipotato commented 2 years ago

I wonder if this could be attempted within a crate to figure out if there is or is not value through practical use for a more robust is_numeric, and then we could talk about merging it later. It does seem like both statements "this is intractable" and "a simpler approach that is tractable may still be useful" could both be true. A lot of the arguments made here also apply to concepts like "time", yet we do still attempt it. I suspect it will be difficult to fully understand what a useful implementation might even look like without prototypes that are actively used by native speakers.

wwylele commented 2 years ago

I am a native Chinese speaker. Before seeing this issue, I don't trust is_numeric outside of ASCII range at all (and hence would probably never use it given we already have is_digit/is_ascii_digit). And, to me the question "Is 五 numeric?" roughly has the same vibe of "Is 'five' numeric?" - the answer is: yes, but that's not a property on a character/unicode code point - '五' / 'five' is a word!

Manishearth commented 2 years ago

Hey folks, I don't think it's necessary to discuss those facets of this issue at this time unless the libs team wants more discussion here.

I did share my comment to a wider audience but that doesn't mean we need wider input here, let's let the libs team decide next steps (or if they want to get more feedback). This isn't an RFC, and a long discussion only makes more work for the team. If y'all have new and relevant points to bring up you should.

(Also, as far as why Unicode has such categories in the first place: as with most Unicode properties, there's a decently consistent definition being used internally, and the property has uses in various Unicode algorithms, however Unicode properties aren't really intended to be used based on vibes which is why it feels weird that Unicode is attempting such a distinction in the first place)

wsy2220 commented 2 years ago

As a native Chinese speaker, I think most programmers have the assumption that is_numeric is only true for ascii numbers. Expanding the definition of numbers will generate endless surprises, some of which may be security vulnerablilities.

mafrasi2 commented 2 years ago

As a native Chinese speaker, I think most programmers have the assumption that is_numeric is only true for ascii numbers. Expanding the definition of numbers will generate endless surprises, some of which may be security vulnerablilities.

That already isn't the case though, for example is_numeric is also true for ①, ¾ and ৬.

dtomvan commented 2 years ago

As a native Chinese speaker, I think most programmers have the assumption that is_numeric is only true for ascii numbers. Expanding the definition of numbers will generate endless surprises, some of which may be security vulnerablilities.

That already isn't the case though, for example is_numeric is also true for ①, ¾ and ৬.

To add to that: if you want to check if a char is [0-9], then you can always use is_ascii_digit. Combining it with str::chars allows for checking it on any string as well.

The naming of these functions (especially the ascii-variant) and the documentation make it very clear how these should be used.

My proposal: add an is_numeric_unicode function, to comply with unicode, allowing all characters that do the exact same unice does. We could also use something like is_numeric_ascii, so people don't confuse the current is_numeric (which could also probably use a name change to is_number_like or maybe is_numeric_class, see this comment) with the aforementioned usage of is_ascii_digit:

"1234".chars().all(|c| c.is_ascii_digit())

The Chinese numerals could be included in its own function, in order to support them through the stdlib.

m-ou-se commented 2 years ago

I wonder what exact use cases people are currently using char.is_numeric() for. For validation and parsing, I'd expect most users to need char.is_ascii_digit() instead.

ChrisDenton commented 2 years ago

I wonder what exact use cases people are currently using char.is_numeric() for. For validation and parsing, I'd expect most users to need char.is_ascii_digit() instead.

I think that is essentially the conclusion that can be drawn from https://github.com/rust-lang/rust/issues/84056#issuecomment-1185350866. I.e. most users of is_numeric were using the wrong function.

And from what's been said above, I feel like having char.is_numeric() in the standard library was perhaps a mistake even if the name was more explicit about what it does. Not because it can't possibly be useful, but because it's quite niche and there are many different ways people may or may not want to test for "number-like" things such as ①, "two" or ¾.

I do still think that the docs for is_numeric should at least direct people towards is_ascii_digit or is_digit.

m-ou-se commented 2 years ago

Ah yes, thanks.

I feel like having char.is_numeric() in the standard library was perhaps a mistake even if the name was more explicit about what it does. Not because it can't possibly be useful, but because it's quite niche and there are many different ways people may or may not want to test for "number-like" things such as ①, "two" or ¾.

I agree.

joshtriplett commented 2 years ago

Based on @Manishearth's comment and subsequent discussion, we talked about this in today's libs-api meeting, and we agreed that the code shouldn't change here, but the documentation should tell people that this probably isn't what they want, and point to is_digit or is_ascii_* instead.

We'd welcome a documentation PR.

rust-lang / rust