python / cpython

The Python programming language
https://www.python.org
Other
62.87k stars 30.12k forks source link

Document the differences between str.isdigit, isdecimal and isnumeric #80281

Open 40a071e0-9596-4128-9ee3-72bee3c8f16c opened 5 years ago

40a071e0-9596-4128-9ee3-72bee3c8f16c commented 5 years ago
BPO 36100
Nosy @mdickinson, @stevendaprano, @tirkarthi

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'docs'] title = 'Document the differences between str.isdigit, isdecimal and isnumeric' updated_at = user = 'https://bugs.python.org/StyXman' ``` bugs.python.org fields: ```python activity = actor = 'steven.daprano' assignee = 'docs@python' closed = False closed_date = None closer = None components = ['Documentation'] creation = creator = 'StyXman' dependencies = [] files = [] hgrepos = [] issue_num = 36100 keywords = [] message_count = 12.0 messages = ['336451', '336453', '336454', '336455', '336456', '336459', '336460', '336461', '336462', '336464', '336466', '336467'] nosy_count = 5.0 nosy_names = ['mark.dickinson', 'StyXman', 'steven.daprano', 'docs@python', 'xtreak'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue36100' versions = [] ```

40a071e0-9596-4128-9ee3-72bee3c8f16c commented 5 years ago

Following https://blog.lerner.co.il/pythons-str-isdigit-vs-str-isnumeric/, we have this:

Python 3.8.0a1+ (heads/master:001fee14e0, Feb 20 2019, 08:28:02)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> '一二三四五'.isnumeric()
True

>>> int('一二三四五')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '一二三四五'

>>> float('一二三四五')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '一二三四五'

I think Reuven is right, these should be accepted as input. I just wonder if we should do the same for f.i. roman numerics...

stevendaprano commented 5 years ago

I think that analysis is wrong. The Wikipedia page describes the meaning of the Unicode Decimal/Digit/Numeric properties:

https://en.wikipedia.org/wiki/Unicode_character_property#Numeric_values_and_types

and the characters you show aren't appropriate for converting to ints:

py> for c in '一二三四五': ... print(unicodedata.name(c)) ... CJK UNIFIED IDEOGRAPH-4E00 CJK UNIFIED IDEOGRAPH-4E8C CJK UNIFIED IDEOGRAPH-4E09 CJK UNIFIED IDEOGRAPH-56DB CJK UNIFIED IDEOGRAPH-4E94

The first one, for example, is translated as "one; a, an; alone"; it is better read as the *word* one rather than the numeral 1. (Disclaimer: I am not a Chinese speaker and I welcome correction from an expert.)

Likewise U+4E8C, translated as "two; twice".

The blog post is factually wrong when it claims:

"str.isdigit only returns True for what I said before, strings containing solely the digits 0-9."

py> s = "\N{BENGALI DIGIT ONE}\N{BENGALI DIGIT TWO}" py> s.isdigit() True py> int(s) 12

So I think that there's nothing to do here (unless it is perhaps to add a FAQ about it, or improve the docs).

mdickinson commented 5 years ago

[Steven posted his answer while I was composing mine; posting mine anyway ...]

I don't think this would make sense. There are lots of characters that can't be interpreted as a decimal digit but for which isnumeric nevertheless gives True.

>>> s = "㉓⅗⒘Ⅻ"
>>> for c in s: print(unicodedata.name(c))
... 
CIRCLED NUMBER TWENTY THREE
VULGAR FRACTION THREE FIFTHS
NUMBER SEVENTEEN FULL STOP
ROMAN NUMERAL TWELVE
>>> s.isnumeric()
True

What value would you expect int(s) to have in this situation?

Note that int and float already accept non-ASCII digits:

>>> s = "١٢٣٤٥٦٧٨٩"
>>> int(s)
123456789
>>> float(s)
123456789.0
mdickinson commented 5 years ago

What value would you expect int(s) to have in this situation?

Actually, I guess that question was too easy. The value for int(s) should obviously be 23 1000 + (3/5) 100 + 17 * 10 + 12 = 23242. I should have used ⅐ instead of ⅗.

Anyway, agreed with Steven that no change should be made here.

tirkarthi commented 5 years ago

Not a unicode expert but searching along the lines there was a note added on bpo-10610 that int() is supported for characters of 'Nd' category. So to check if a string can be converted to integer with help of int() I should be using str.isdecimal() instead of str.isnumeric() ?

https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex

The numeric literals accepted include the digits 0 to 9 or any Unicode equivalent (code points with the Nd property). See http://www.unicode.org/Public/10.0.0/ucd/extracted/DerivedNumericType.txt for a complete list of code points with the Nd property.

>>> [unicodedata.category(c) for c in '一二三四五']
['Lo', 'Lo', 'Lo', 'Lo', 'Lo']
>>> [unicodedata.category(c) for c in '\N{BENGALI DIGIT ONE}\N{BENGALI DIGIT TWO}']
['Nd', 'Nd']
mdickinson commented 5 years ago

So to check if a string can be converted to integer with help of int() I should be using str.isdecimal() instead of str.isnumeric() ?

Yes, I think that's correct. The characters matched by str.isdecimal are a subset of those matched by str.isdigit, which in turn are a subset of those matched by str.isnumeric. int and float required general category Nd, which corresponds to str.isdigit.

mdickinson commented 5 years ago

which corresponds to str.isdigit.

Gah! That should have said:

which corresponds to str.isdecimal.

Sorry.

tirkarthi commented 5 years ago

int and float required general category Nd, which corresponds to str.isdigit.

Sorry, did you mean str.isdecimal? since there could be a subset where isdigit is True and isdecimal returns False.

>>> '\u00B2'.isdecimal()
False
>>> '\u00B2'.isdigit()
True
>>> import unicodedata
>>> unicodedata.category('\u00B2')
'No'
>>> int('\u00B2')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'

Is this worth an FAQ or an addition to the existing note on int that specifies characters should belong to 'Nd' category to add a note that str.isdecimal should return True

stevendaprano commented 5 years ago

On Sun, Feb 24, 2019 at 11:07:41AM +0000, Karthikeyan Singaravelan wrote:

Is this worth an FAQ or an addition to the existing note on int that specifies characters should belong to 'Nd' category to add a note that str.isdecimal should return True

Yes, I think that there should be a FAQ about the differences between isdigit, isdecimal and isnumeric, pointing to the relevant Unicode documentation. I would also like to see a briefer note added to each of the string methods docstrings as well.

tirkarthi commented 5 years ago

Agreed, though str.isnumeric behavior might seem to be correct in terms of user who knows unicode internals the naming makes it easy to be used for a general user on trying to determine if the string can be used for int() without knowing unicode internals. I am not sure how this can be explained in simpler terms but it would be good if clarified in the docs to avoid confusion.

There seems to be have been thread [0] in the past about multiple ways to check for a unicode literal to be number causing confusion. It adds more confusion on Python 2 where strings are not unicode by default.

$ python2.7
Python 2.7.14 (default, Mar 12 2018, 13:54:56)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u00B2'.isdigit()
False
>>> u'\u00B2'.isdigit()
True

[0] https://mail.python.org/pipermail/python-list/2012-May/624340.html

40a071e0-9596-4128-9ee3-72bee3c8f16c commented 5 years ago

Thanks for all the examples, I'm convinced.

stevendaprano commented 5 years ago

I'm re-opening the ticket with a change of subject, because I think this should be treated as a documentation enhancement:

https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex

I don't think we need to worry about backporting the docs to Python 2, but if others disagree, I won't object.

mdickinson commented 2 years ago

I just closed the related issue: https://github.com/python/cpython/issues/93335. It looks as though there are cases where people are using the str.is* methods in LBYL style to ensure that a later int or float will pass. Given that use-case, I wonder whether it's worth a note in the str.isdecimal docs pointing out that this is the method that corresponds to what's accepted by (most of) the numeric constructors. (Or perhaps a note in str.isdigit or str.isnumeric pointing out the contrary.)

mdickinson commented 2 years ago

It's also worth noting that the isdigit property is in some sense deprecated by the Unicode consortium, and is not considered to be useful. Quoting from tr44,

Starting with Unicode 6.3.0, no newly encoded numeric characters will be given Numeric_Type=Digit, nor will existing characters with Numeric_Type=Numeric be changed to Numeric_Type=Digit. The distinction between those two types is not considered useful.

nicoszerman commented 2 years ago

I had to resort to TryCatching float(possible_number) to find out if I can float() a string. I have a few years of experience coding in python, but surprisingly today's the first time I stumble with this question: Is this TryCatching example Pythonic style, or am I missing some function to find out if it's possible to convert a string to float? @mdickinson

TeamSpen210 commented 2 years ago

It’s the correct way yes. Commonly it’s described as “easier to ask forgiveness than permission”, in contrast to “look before you leap”. The exception style is generally a little better since you bypass issues where the check and the action don’t have quite the same behaviour, and also avoid doing the check twice.