Open 40a071e0-9596-4128-9ee3-72bee3c8f16c opened 5 years ago
Following https://blog.lerner.co.il/pythons-str-isdigit-vs-str-isnumeric/, we have this:
Python 3.8.0a1+ (heads/master:001fee14e0, Feb 20 2019, 08:28:02)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> '一二三四五'.isnumeric()
True
>>> int('一二三四五')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '一二三四五'
>>> float('一二三四五')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '一二三四五'
I think Reuven is right, these should be accepted as input. I just wonder if we should do the same for f.i. roman numerics...
I think that analysis is wrong. The Wikipedia page describes the meaning of the Unicode Decimal/Digit/Numeric properties:
https://en.wikipedia.org/wiki/Unicode_character_property#Numeric_values_and_types
and the characters you show aren't appropriate for converting to ints:
py> for c in '一二三四五': ... print(unicodedata.name(c)) ... CJK UNIFIED IDEOGRAPH-4E00 CJK UNIFIED IDEOGRAPH-4E8C CJK UNIFIED IDEOGRAPH-4E09 CJK UNIFIED IDEOGRAPH-56DB CJK UNIFIED IDEOGRAPH-4E94
The first one, for example, is translated as "one; a, an; alone"; it is better read as the *word* one rather than the numeral 1. (Disclaimer: I am not a Chinese speaker and I welcome correction from an expert.)
Likewise U+4E8C, translated as "two; twice".
The blog post is factually wrong when it claims:
"str.isdigit only returns True for what I said before, strings containing solely the digits 0-9."
py> s = "\N{BENGALI DIGIT ONE}\N{BENGALI DIGIT TWO}" py> s.isdigit() True py> int(s) 12
So I think that there's nothing to do here (unless it is perhaps to add a FAQ about it, or improve the docs).
[Steven posted his answer while I was composing mine; posting mine anyway ...]
I don't think this would make sense. There are lots of characters that can't be interpreted as a decimal digit but for which isnumeric
nevertheless gives True.
>>> s = "㉓⅗⒘Ⅻ"
>>> for c in s: print(unicodedata.name(c))
...
CIRCLED NUMBER TWENTY THREE
VULGAR FRACTION THREE FIFTHS
NUMBER SEVENTEEN FULL STOP
ROMAN NUMERAL TWELVE
>>> s.isnumeric()
True
What value would you expect int(s)
to have in this situation?
Note that int
and float
already accept non-ASCII digits:
>>> s = "١٢٣٤٥٦٧٨٩"
>>> int(s)
123456789
>>> float(s)
123456789.0
What value would you expect
int(s)
to have in this situation?
Actually, I guess that question was too easy. The value for int(s)
should obviously be 23 1000 + (3/5) 100 + 17 * 10 + 12 = 23242. I should have used ⅐ instead of ⅗.
Anyway, agreed with Steven that no change should be made here.
Not a unicode expert but searching along the lines there was a note added on bpo-10610 that int() is supported for characters of 'Nd' category. So to check if a string can be converted to integer with help of int() I should be using str.isdecimal() instead of str.isnumeric() ?
https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex
The numeric literals accepted include the digits 0 to 9 or any Unicode equivalent (code points with the Nd property). See http://www.unicode.org/Public/10.0.0/ucd/extracted/DerivedNumericType.txt for a complete list of code points with the Nd property.
>>> [unicodedata.category(c) for c in '一二三四五']
['Lo', 'Lo', 'Lo', 'Lo', 'Lo']
>>> [unicodedata.category(c) for c in '\N{BENGALI DIGIT ONE}\N{BENGALI DIGIT TWO}']
['Nd', 'Nd']
So to check if a string can be converted to integer with help of int() I should be using str.isdecimal() instead of str.isnumeric() ?
Yes, I think that's correct. The characters matched by str.isdecimal
are a subset of those matched by str.isdigit
, which in turn are a subset of those matched by str.isnumeric
. int
and float
required general category Nd, which corresponds to str.isdigit
.
which corresponds to
str.isdigit
.
Gah! That should have said:
which corresponds to
str.isdecimal
.
Sorry.
int
andfloat
required general category Nd, which corresponds tostr.isdigit
.
Sorry, did you mean str.isdecimal? since there could be a subset where isdigit is True and isdecimal returns False.
>>> '\u00B2'.isdecimal()
False
>>> '\u00B2'.isdigit()
True
>>> import unicodedata
>>> unicodedata.category('\u00B2')
'No'
>>> int('\u00B2')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
Is this worth an FAQ or an addition to the existing note on int that specifies characters should belong to 'Nd' category to add a note that str.isdecimal should return True
On Sun, Feb 24, 2019 at 11:07:41AM +0000, Karthikeyan Singaravelan wrote:
Is this worth an FAQ or an addition to the existing note on int that specifies characters should belong to 'Nd' category to add a note that str.isdecimal should return True
Yes, I think that there should be a FAQ about the differences between isdigit, isdecimal and isnumeric, pointing to the relevant Unicode documentation. I would also like to see a briefer note added to each of the string methods docstrings as well.
Agreed, though str.isnumeric behavior might seem to be correct in terms of user who knows unicode internals the naming makes it easy to be used for a general user on trying to determine if the string can be used for int() without knowing unicode internals. I am not sure how this can be explained in simpler terms but it would be good if clarified in the docs to avoid confusion.
There seems to be have been thread [0] in the past about multiple ways to check for a unicode literal to be number causing confusion. It adds more confusion on Python 2 where strings are not unicode by default.
$ python2.7
Python 2.7.14 (default, Mar 12 2018, 13:54:56)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u00B2'.isdigit()
False
>>> u'\u00B2'.isdigit()
True
[0] https://mail.python.org/pipermail/python-list/2012-May/624340.html
Thanks for all the examples, I'm convinced.
I'm re-opening the ticket with a change of subject, because I think this should be treated as a documentation enhancement:
improve the docstrings for str.isdigit, isnumeric and isdecimal to make it clear what each does (e.g. what counts as a digit);
similarly improve the documentation for int and float? although the existing comment may be sufficient
https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex
I don't think we need to worry about backporting the docs to Python 2, but if others disagree, I won't object.
I just closed the related issue: https://github.com/python/cpython/issues/93335. It looks as though there are cases where people are using the str.is*
methods in LBYL style to ensure that a later int
or float
will pass. Given that use-case, I wonder whether it's worth a note in the str.isdecimal
docs pointing out that this is the method that corresponds to what's accepted by (most of) the numeric constructors. (Or perhaps a note in str.isdigit
or str.isnumeric
pointing out the contrary.)
It's also worth noting that the isdigit
property is in some sense deprecated by the Unicode consortium, and is not considered to be useful. Quoting from tr44,
Starting with Unicode 6.3.0, no newly encoded numeric characters will be given Numeric_Type=Digit, nor will existing characters with Numeric_Type=Numeric be changed to Numeric_Type=Digit. The distinction between those two types is not considered useful.
I had to resort to TryCatching float(possible_number) to find out if I can float() a string. I have a few years of experience coding in python, but surprisingly today's the first time I stumble with this question: Is this TryCatching example Pythonic style, or am I missing some function to find out if it's possible to convert a string to float? @mdickinson
It’s the correct way yes. Commonly it’s described as “easier to ask forgiveness than permission”, in contrast to “look before you leap”. The exception style is generally a little better since you bypass issues where the check and the action don’t have quite the same behaviour, and also avoid doing the check twice.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-feature', 'docs']
title = 'Document the differences between str.isdigit, isdecimal and isnumeric'
updated_at =
user = 'https://bugs.python.org/StyXman'
```
bugs.python.org fields:
```python
activity =
actor = 'steven.daprano'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation']
creation =
creator = 'StyXman'
dependencies = []
files = []
hgrepos = []
issue_num = 36100
keywords = []
message_count = 12.0
messages = ['336451', '336453', '336454', '336455', '336456', '336459', '336460', '336461', '336462', '336464', '336466', '336467']
nosy_count = 5.0
nosy_names = ['mark.dickinson', 'StyXman', 'steven.daprano', 'docs@python', 'xtreak']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue36100'
versions = []
```