python / cpython

The Python programming language
https://www.python.org
Other
63.4k stars 30.36k forks source link

Treat U+4E17 as a numeric value #78944

Closed ghost closed 5 years ago

ghost commented 6 years ago
BPO 34763
Nosy @malemburg, @vstinner, @benjaminp, @ezio-melotti, @stevendaprano, @berkerpeksag, @zhangyangyu, @johnlinp
PRs
  • python/cpython#9474
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-bug', 'invalid', '3.9', 'expert-unicode'] title = 'Treat U+4E17 as a numeric value' updated_at = user = None ``` bugs.python.org fields: ```python activity = actor = 'xiang.zhang' assignee = 'none' closed = True closed_date = closer = 'xiang.zhang' components = ['Unicode'] creation = creator = '\xe8\x8d\x89\xe6\x9c\xa8\xe5\xbb\xba' dependencies = [] files = [] hgrepos = [] issue_num = 34763 keywords = ['patch'] message_count = 8.0 messages = ['325992', '326010', '326011', '326034', '326055', '344144', '344391', '344440'] nosy_count = 10.0 nosy_names = ['lemburg', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'mrabarnett', 'steven.daprano', 'berker.peksag', 'xiang.zhang', 'johnlinp', '\xe8\x8d\x89\xe6\x9c\xa8\xe5\xbb\xba'] pr_nums = ['9474'] priority = 'normal' resolution = 'not a bug' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue34763' versions = ['Python 3.9'] ```

    ghost commented 6 years ago

    This is very easy issue.

    丗 meanning is 30.(丗 is 0x4E17) "丗".isnumeric() must returns true. but "丗".isnumeric() returns False.

    malemburg commented 6 years ago

    We use the Unicode database for these methods. Could you please check whether the database marks the character as numeric ?

    If yes, we may need to check the database generation.

    Otherwise, there isn't much we can do, since we use the Unicode database as reference.

    Thanks -- Marc-Andre Lemburg

    Sent from my phone. See http://www.egenix.com/company/ for contact information and impressum.

    On 21 September 2018 18:38:05 GMT+02:00, Serhiy Storchaka \report@bugs.python.org\ wrote:

    Change by Serhiy Storchaka \storchaka+cpython@gmail.com\:

    ---------- nosy: +lemburg


    Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue34763\


    39d85a87-36ea-41b2-b2bb-2be43abb500e commented 6 years ago

    Unicode 11.0.0 has 卅 (U+5345) as being numeric and having the value 30.

    What's the difference between that and 丗 (U+4E17)?

    I notice that they look at lot alike. Are they different variants, perhaps traditional vs simplified?

    vstinner commented 6 years ago
    $ ./python
    Python 3.8.0a0 (heads/master-dirty:06e7608207, Sep 20 2018, 01:52:01) 
    >>> import unicodedata
    >>> unicodedata.unidata_version
    '11.0.0'
    >>> unicodedata.numeric('\u5345')
    30.0
    >>> unicodedata.numeric('\u4E17')
    ValueError: not a numeric character
    benjaminp commented 6 years ago

    As I said on the PR, this is because Unicode gives U+4E17 (and other CJK ideographs) a numeric value only in the UniHan database not the normal UCD. makeunicodedata.py only looks at UCD for numeric values.

    berkerpeksag commented 5 years ago

    Tools/unicode/makeunicodedata.py looks at Unihan database for the fields kAccountingNumeric, kOtherNumeric, and kPrimaryNumeric in Unihan_NumericValues.txt:

    https://github.com/python/cpython/blob/549e55a3086d04c13da9b6f33214f6399681292a/Tools/unicode/makeunicodedata.py#L1107-L1119

    And as of Unicode version 12.0.0, 0x4E17 isn't listed as numeric there:

    ...
    U+4E00  kPrimaryNumeric 1
    U+4E03  kPrimaryNumeric 7
    U+4E07  kPrimaryNumeric 10000
    U+4E09  kPrimaryNumeric 3
    ...

    Is there another way to get this information by using one of the fields shown at

    http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4E17
    693f263f-a080-47dc-8f14-c27d15b50a75 commented 5 years ago

    "丗" means "30" in Japanese. However, it is a variant Chinese character to "世", where "世" means "world" in Chinese.

    I'm not sure if this information makes any difference.

    zhangyangyu commented 5 years ago

    unicode.org doesn't list "丗" as numeric so I think there is nothing we can do.