python / cpython

The Python programming language
https://www.python.org
Other
63.4k stars 30.36k forks source link

makeunicodedata.py does not support Unihan digit data #54784

Closed malemburg closed 13 years ago

malemburg commented 13 years ago
BPO 10575
Nosy @malemburg, @loewis, @abalkin, @ezio-melotti

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['invalid', 'expert-unicode'] title = 'makeunicodedata.py does not support Unihan digit data' updated_at = user = 'https://github.com/malemburg' ``` bugs.python.org fields: ```python activity = actor = 'loewis' assignee = 'none' closed = True closed_date = closer = 'loewis' components = ['Unicode'] creation = creator = 'lemburg' dependencies = [] files = [] hgrepos = [] issue_num = 10575 keywords = [] message_count = 13.0 messages = ['122786', '122809', '122811', '122812', '122827', '122839', '122851', '122859', '122862', '122863', '122866', '122867', '122868'] nosy_count = 4.0 nosy_names = ['lemburg', 'loewis', 'belopolsky', 'ezio.melotti'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = None status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue10575' versions = ['Python 2.7', 'Python 3.2', 'Python 3.3'] ```

malemburg commented 13 years ago

The script only patches numeric data into the table (field 8), but does not update the digit field (field 7).

As a result, ideographs used for Chinese digits are not recognized as digits and not evaluated by int(), long() and float():

http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture
>>> unicode('三', 'utf-8')
u'\u4e09'

>>> int(unicode('三', 'utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character u'\u4e09' in position 0: invalid decimal Unicode string
> <stdin>(1)<module>()

>>> import unicodedata
>>> unicodedata.digit(unicode('三', 'utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: not a digit

The code point refers to the digit 3.

malemburg commented 13 years ago

The code point is also not listed as decimal digit (relevant for the int() decimal parsing):

>>> unicodedata.decimal(unicode('三', 'utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: not a decimal

This is the relevant part of the script:

        for line in open(unihan):
            if not line.startswith('U+'):
                continue
            code, tag, value = line.split(None, 3)[:3]
            if tag not in ('kAccountingNumeric', 'kPrimaryNumeric',
                           'kOtherNumeric'):
                continue
            value = value.strip().replace(',', '')
            i = int(code[2:], 16)
            # Patch the numeric field
            if table[i] is not None:
                table[i][8] = value

The decimal column is not set for code points that have a kPrimaryNumeric value set. Position table[i][8] refers to the numeric database entry, which correctly gives:

>>> unicodedata.numeric(unicode('三', 'utf-8'))
3.0
malemburg commented 13 years ago

Here's a quick overview of the fields that are set for U+4E09:

http://www.fileformat.info/info/unicode/char/4e09/index.htm

malemburg commented 13 years ago

This is the definition of kPrimaryNumeric

http://ftp.lanet.lv/ftp/mirror/unicode/5.0.0/ucd/Unihan.html#kPrimaryNumeric

abalkin commented 13 years ago

I am adding bpo-10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.

I am also not sure whether this is a bug or a feature request. Martin?

malemburg commented 13 years ago

Alexander Belopolsky wrote:

Alexander Belopolsky \belopolsky@users.sourceforge.net\ added the comment:

I am adding bpo-10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.

I am also not sure whether this is a bug or a feature request. Martin?

I consider this a bug (which is why I added Python 2.7 to the list of versions), since those code points need to be mapped to decimal and digit as well (see the references I posted; and compare ).

Both Chinese and Japanese use the 4E00 ff. code points as decimal code points.

abalkin commented 13 years ago

On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: ..

I consider this a bug (which is why I added Python 2.7 to the list of versions), since those code points need to be mapped to decimal and digit as well (see the references I posted; and compare ).

I don't disagree. However using Unicode 5.2.0 instead of the latest 6.0.0 may be considered a bug as well. The practical issue is whether to maintain two separate versions of Tools/unicode for 3.x and 2.7 or merge 3.x changes back to 2.7 and support 3.x using 2to3. Another option is to simply use only 2.7 (or only 3.x) with Tools/unicode and maintain control the differences between 2.7 and 3.x using a command line switch.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

I am adding bpo-10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.

I am also not sure whether this is a bug or a feature request. Martin?

I fail to see the relevance of gencodec to this issue (and, as you see in my comment to bpo-10552, I very much fail to see the relevance of that issue, or of gencodec in the first place).

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

This is not a bug, see

http://www.unicode.org/reports/tr44/#Numeric_Value

Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see

http://www.unicode.org/reports/tr44/#Numeric_Type_Han

Therefore, it is correct that digit() raises a ValueError for U+4e09.

malemburg commented 13 years ago

Alexander Belopolsky wrote:

Alexander Belopolsky \belopolsky@users.sourceforge.net\ added the comment:

On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: .. > > I consider this a bug (which is why I added Python 2.7 to the list > of versions), since those code points need to be mapped to decimal > and digit as well (see the references I posted; and compare ). >

I don't disagree. However using Unicode 5.2.0 instead of the latest 6.0.0 may be considered a bug as well.

No, since we only ever change the UCD version once per Python release.

Note that those standard don't have a version number just for the fun of it. Each version is a standard of its own and only patch level updates will go into it.

It's not a bug to stick to an older UCD version.

The practical issue is whether to maintain two separate versions of Tools/unicode for 3.x and 2.7 or merge 3.x changes back to 2.7 and support 3.x using 2to3. Another option is to simply use only 2.7 (or only 3.x) with Tools/unicode and maintain control the differences between 2.7 and 3.x using a command line switch.

I'm not sure whether the effort is worth it. We don't run those tools often enough to invest much time into keeping them in sync between 2.x and 3.x.

abalkin commented 13 years ago

I fail to see the relevance of gencodec to this issue ...

Thanks for the explanation. I wrongly assumed that "make all" is the way to regenerate both unicodedata and the encodings and that the two are interdependent.

malemburg commented 13 years ago

Martin v. Löwis wrote:

Martin v. Löwis \martin@v.loewis.de\ added the comment:

This is not a bug, see

http://www.unicode.org/reports/tr44/#Numeric_Value

Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see

http://www.unicode.org/reports/tr44/#Numeric_Type_Han

Therefore, it is correct that digit() raises a ValueError for U+4e09.

You're right. I guess this is a bug in the UCD or TR44/TR38 itself.

It looks like the numeric properties are not separated in the Unihan database in the same way they are for the standard UCD.

Unihan separates based on usage context, whereas UCS takes a parsing approach.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

Thanks for the explanation. I wrongly assumed that "make all" is the way to regenerate both unicodedata and the encodings and that the two are interdependent.

Ah. I never use the Makefile.