Closed malemburg closed 13 years ago
The script only patches numeric data into the table (field 8), but does not update the digit field (field 7).
As a result, ideographs used for Chinese digits are not recognized as digits and not evaluated by int(), long() and float():
http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture
>>> unicode('三', 'utf-8')
u'\u4e09'
>>> int(unicode('三', 'utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character u'\u4e09' in position 0: invalid decimal Unicode string
> <stdin>(1)<module>()
>>> import unicodedata
>>> unicodedata.digit(unicode('三', 'utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: not a digit
The code point refers to the digit 3.
The code point is also not listed as decimal digit (relevant for the int() decimal parsing):
>>> unicodedata.decimal(unicode('三', 'utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: not a decimal
This is the relevant part of the script:
for line in open(unihan):
if not line.startswith('U+'):
continue
code, tag, value = line.split(None, 3)[:3]
if tag not in ('kAccountingNumeric', 'kPrimaryNumeric',
'kOtherNumeric'):
continue
value = value.strip().replace(',', '')
i = int(code[2:], 16)
# Patch the numeric field
if table[i] is not None:
table[i][8] = value
The decimal column is not set for code points that have a kPrimaryNumeric value set. Position table[i][8] refers to the numeric database entry, which correctly gives:
>>> unicodedata.numeric(unicode('三', 'utf-8'))
3.0
Here's a quick overview of the fields that are set for U+4E09:
This is the definition of kPrimaryNumeric
http://ftp.lanet.lv/ftp/mirror/unicode/5.0.0/ucd/Unihan.html#kPrimaryNumeric
I am adding bpo-10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.
I am also not sure whether this is a bug or a feature request. Martin?
Alexander Belopolsky wrote:
Alexander Belopolsky \belopolsky@users.sourceforge.net\ added the comment:
I am adding bpo-10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.
I am also not sure whether this is a bug or a feature request. Martin?
I consider this a bug (which is why I added Python 2.7 to the list of versions), since those code points need to be mapped to decimal and digit as well (see the references I posted; and compare ).
Both Chinese and Japanese use the 4E00 ff. code points as decimal code points.
On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: ..
I consider this a bug (which is why I added Python 2.7 to the list of versions), since those code points need to be mapped to decimal and digit as well (see the references I posted; and compare ).
I don't disagree. However using Unicode 5.2.0 instead of the latest 6.0.0 may be considered a bug as well. The practical issue is whether to maintain two separate versions of Tools/unicode for 3.x and 2.7 or merge 3.x changes back to 2.7 and support 3.x using 2to3. Another option is to simply use only 2.7 (or only 3.x) with Tools/unicode and maintain control the differences between 2.7 and 3.x using a command line switch.
I am adding bpo-10552 as a dependency because I think we should fix unicode data generation in 3.x before adding new features to the scripts.
I am also not sure whether this is a bug or a feature request. Martin?
I fail to see the relevance of gencodec to this issue (and, as you see in my comment to bpo-10552, I very much fail to see the relevance of that issue, or of gencodec in the first place).
This is not a bug, see
http://www.unicode.org/reports/tr44/#Numeric_Value
Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see
http://www.unicode.org/reports/tr44/#Numeric_Type_Han
Therefore, it is correct that digit() raises a ValueError for U+4e09.
Alexander Belopolsky wrote:
Alexander Belopolsky \belopolsky@users.sourceforge.net\ added the comment:
On Mon, Nov 29, 2010 at 1:29 PM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: .. > > I consider this a bug (which is why I added Python 2.7 to the list > of versions), since those code points need to be mapped to decimal > and digit as well (see the references I posted; and compare ). >
I don't disagree. However using Unicode 5.2.0 instead of the latest 6.0.0 may be considered a bug as well.
No, since we only ever change the UCD version once per Python release.
Note that those standard don't have a version number just for the fun of it. Each version is a standard of its own and only patch level updates will go into it.
It's not a bug to stick to an older UCD version.
The practical issue is whether to maintain two separate versions of Tools/unicode for 3.x and 2.7 or merge 3.x changes back to 2.7 and support 3.x using 2to3. Another option is to simply use only 2.7 (or only 3.x) with Tools/unicode and maintain control the differences between 2.7 and 3.x using a command line switch.
I'm not sure whether the effort is worth it. We don't run those tools often enough to invest much time into keeping them in sync between 2.x and 3.x.
I fail to see the relevance of gencodec to this issue ...
Thanks for the explanation. I wrongly assumed that "make all" is the way to regenerate both unicodedata and the encodings and that the two are interdependent.
Martin v. Löwis wrote:
Martin v. Löwis \martin@v.loewis.de\ added the comment:
This is not a bug, see
http://www.unicode.org/reports/tr44/#Numeric_Value
Characters have a Numeric_Type property of either null, Decimal, Digit, or Numeric. For non-Unihan characters, this is denoted by filling out either no column, or (6,7,and 8), or (7 and 8), or (8), respectively, as implemented by makeunicodedata.py. Unihan characters have only null or Numeric as their Numeric_Type property, never Decimal nor Digit, see
http://www.unicode.org/reports/tr44/#Numeric_Type_Han
Therefore, it is correct that digit() raises a ValueError for U+4e09.
You're right. I guess this is a bug in the UCD or TR44/TR38 itself.
It looks like the numeric properties are not separated in the Unihan database in the same way they are for the standard UCD.
Unihan separates based on usage context, whereas UCS takes a parsing approach.
Thanks for the explanation. I wrongly assumed that "make all" is the way to regenerate both unicodedata and the encodings and that the two are interdependent.
Ah. I never use the Makefile.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['invalid', 'expert-unicode']
title = 'makeunicodedata.py does not support Unihan digit data'
updated_at =
user = 'https://github.com/malemburg'
```
bugs.python.org fields:
```python
activity =
actor = 'loewis'
assignee = 'none'
closed = True
closed_date =
closer = 'loewis'
components = ['Unicode']
creation =
creator = 'lemburg'
dependencies = []
files = []
hgrepos = []
issue_num = 10575
keywords = []
message_count = 13.0
messages = ['122786', '122809', '122811', '122812', '122827', '122839', '122851', '122859', '122862', '122863', '122866', '122867', '122868']
nosy_count = 4.0
nosy_names = ['lemburg', 'loewis', 'belopolsky', 'ezio.melotti']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue10575'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']
```