Treat U+4E17 as a numeric value

ghost commented 6 years ago

BPO	34763
Nosy	@malemburg, @vstinner, @benjaminp, @ezio-melotti, @stevendaprano, @berkerpeksag, @zhangyangyu, @johnlinp
PRs	python/cpython#9474

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-bug', 'invalid', '3.9', 'expert-unicode'] title = 'Treat U+4E17 as a numeric value' updated_at = user = None ``` bugs.python.org fields: ```python activity = actor = 'xiang.zhang' assignee = 'none' closed = True closed_date = closer = 'xiang.zhang' components = ['Unicode'] creation = creator = '\xe8\x8d\x89\xe6\x9c\xa8\xe5\xbb\xba' dependencies = [] files = [] hgrepos = [] issue_num = 34763 keywords = ['patch'] message_count = 8.0 messages = ['325992', '326010', '326011', '326034', '326055', '344144', '344391', '344440'] nosy_count = 10.0 nosy_names = ['lemburg', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'mrabarnett', 'steven.daprano', 'berker.peksag', 'xiang.zhang', 'johnlinp', '\xe8\x8d\x89\xe6\x9c\xa8\xe5\xbb\xba'] pr_nums = ['9474'] priority = 'normal' resolution = 'not a bug' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue34763' versions = ['Python 3.9'] ```

ghost commented 6 years ago

This is very easy issue.

丗 meanning is 30.(丗 is 0x4E17) "丗".isnumeric() must returns true. but "丗".isnumeric() returns False.

malemburg commented 6 years ago

We use the Unicode database for these methods. Could you please check whether the database marks the character as numeric ?

If yes, we may need to check the database generation.

Otherwise, there isn't much we can do, since we use the Unicode database as reference.

Thanks -- Marc-Andre Lemburg

Sent from my phone. See http://www.egenix.com/company/ for contact information and impressum.

On 21 September 2018 18:38:05 GMT+02:00, Serhiy Storchaka \report@bugs.python.org\ wrote:

Change by Serhiy Storchaka \storchaka+cpython@gmail.com\:

---------- nosy: +lemburg

Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue34763\

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 6 years ago

Unicode 11.0.0 has 卅 (U+5345) as being numeric and having the value 30.

What's the difference between that and 丗 (U+4E17)?

I notice that they look at lot alike. Are they different variants, perhaps traditional vs simplified?

vstinner commented 6 years ago

$ ./python
Python 3.8.0a0 (heads/master-dirty:06e7608207, Sep 20 2018, 01:52:01) 
>>> import unicodedata
>>> unicodedata.unidata_version
'11.0.0'
>>> unicodedata.numeric('\u5345')
30.0
>>> unicodedata.numeric('\u4E17')
ValueError: not a numeric character

benjaminp commented 6 years ago

As I said on the PR, this is because Unicode gives U+4E17 (and other CJK ideographs) a numeric value only in the UniHan database not the normal UCD. makeunicodedata.py only looks at UCD for numeric values.

berkerpeksag commented 5 years ago

Tools/unicode/makeunicodedata.py looks at Unihan database for the fields kAccountingNumeric, kOtherNumeric, and kPrimaryNumeric in Unihan_NumericValues.txt:

https://github.com/python/cpython/blob/549e55a3086d04c13da9b6f33214f6399681292a/Tools/unicode/makeunicodedata.py#L1107-L1119

And as of Unicode version 12.0.0, 0x4E17 isn't listed as numeric there:

...
U+4E00  kPrimaryNumeric 1
U+4E03  kPrimaryNumeric 7
U+4E07  kPrimaryNumeric 10000
U+4E09  kPrimaryNumeric 3
...

Is there another way to get this information by using one of the fields shown at

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4E17

693f263f-a080-47dc-8f14-c27d15b50a75 commented 5 years ago

"丗" means "30" in Japanese. However, it is a variant Chinese character to "世", where "世" means "world" in Chinese.

I'm not sure if this information makes any difference.

zhangyangyu commented 5 years ago

unicode.org doesn't list "丗" as numeric so I think there is nothing we can do.

python / cpython

Treat U+4E17 as a numeric value #78944