python / cpython

The Python programming language
https://www.python.org
Other
63.49k stars 30.41k forks source link

Add Big5-ETen codec: Python big5 codec cannot decode \xf9\xd8 bytes (U+7881 expected) #52104

Open 6a5168ff-494a-4aaf-b54c-db28be967cf9 opened 14 years ago

6a5168ff-494a-4aaf-b54c-db28be967cf9 commented 14 years ago
BPO 7856
Nosy @loewis, @hyeshik, @vstinner, @batterseapower

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = 'https://github.com/hyeshik' closed_at = None created_at = labels = ['type-bug', 'expert-unicode'] title = 'Add Big5-ETen codec: Python big5 codec cannot decode \\xf9\\xd8 bytes (U+7881 expected)' updated_at = user = 'https://bugs.python.org/Xueferx' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'hyeshik.chang' closed = False closed_date = None closer = None components = ['Unicode'] creation = creator = 'Xuefer.x' dependencies = [] files = [] hgrepos = [] issue_num = 7856 keywords = [] message_count = 11.0 messages = ['98865', '98866', '98867', '98868', '98869', '98911', '218790', '218801', '218804', '388365', '388380'] nosy_count = 8.0 nosy_names = ['loewis', 'hyeshik.chang', 'vstinner', 'rpetrov', 'Xuefer.x', 'kennyluck', 'inndy', 'batterseapower'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue7856' versions = ['Python 2.7', 'Python 3.2', 'Python 3.3'] ```

6a5168ff-494a-4aaf-b54c-db28be967cf9 commented 14 years ago

using iconv: $ printf "\xf9\xd8" | iconv -f big5 -t utf-8 | xxd 0000000: e8a3 8f ... $ printf "\xe8\xa3\x8f" | iconv -f utf-8 -t big5 | xxd 0000000: f9d8 ..

using python
>>> print "\xf9\xd8".decode("big5")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence
>>> print "\xe8\xa3\x8f".decode("utf-8").encode("big5")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'big5' codec can't encode character u'\u88cf' in position 0: illegal multibyte sequence
61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

That iconv supports it is not convincing, IMO. Do you have other sources (like tables in the web somewhere) that support your request?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

In particular, the Unicode consortium mapping table, now at

http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT

doesn't map f9d8 to anything; the current version of that table (in unihan.zip) has these mappings for U+88CF:

U+88CF kCCCII 232E61 U+88CF kCNS1986 E-444E U+88CF kCNS1992 3-444E U+88CF kEACC 215763 U+88CF kGB1 3279 U+88CF kHKSCS F9D8 U+88CF kJis0 4602 U+88CF kKPS0 D9E0 U+88CF kKSC0 5574 U+88CF kTaiwanTelegraph 5937 U+88CF kXerox 241:102

As you can see, it isn't supported in big5.

6a5168ff-494a-4aaf-b54c-db28be967cf9 commented 14 years ago

sure after enlighten by your url which is OBSOLETE see: http://www.unicode.org/Public/MAPPINGS/EASTASIA/ReadMe.txt i found http://unicode.org/charts/unihan.html then http://www.unicode.org/Public/UNIDATA/ then http://www.unicode.org/Public/UNIDATA/Unihan.zip in side the zip, open Unihan_OtherMappings.txt big 5 includes

kBigFive

kHKSCS

which are listed in Unihan_OtherMappings.txt HKSCS is one of the big-5 encoding and i search for F9D8 got U+88CF kHKSCS F9D8

you may also want to update other encoding map table to catch up with Unihan_OtherMappings.txt

thanks for your quick reply btw

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

perky, what do you think?

d8d5aad8-e55b-4500-a3a0-9ea982d771ff commented 14 years ago

That iconv supports it is not convincing, ...

GNU libc is not convincing . What you talking about ?

608e125a-d232-49fa-b637-ec37af4dee7f commented 10 years ago

I'm Taiwanese, F9D8 in big5 should be mapped to E8A38F in UTF-8.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 10 years ago

I'm still looking for an official source of that.

>>> u"\u88cf".encode("big5hkscs")
'\xf9\xd8'

works fine (and always has been working fine), and the character clearly is in big5hkscs. According to

http://en.wikipedia.org/wiki/Big5

F9D8 is "Reserved for user-defined characters", so this suggests that the character does *not* have a fixed meaning in BIG-5. However, it is part of the Hong Kong Supplementary Character Set.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 10 years ago

Inndy, you might also be talking about big5-2003, from

http://www.csie.ntu.edu.tw/~r92030/project/big5/

Python currently does not support big5-2003, but a contribution of such an encoding would surely be welcome.

0d77b226-ca17-4273-83f5-43516d42e91a commented 3 years ago

As of Python 3.7.9 this also affects \xf9\xd6 which should be \u7881 in Unicode. This character is the second character of 宏碁 which is the name of the Taiwanese electronics manufacturer Acer.

You can work around the issue using big5hkscs just like with the original \xf9\xd8 problem.

It looks like the F9D6–F9FE characters all come from the Big5-ETen extension (https://en.wikipedia.org/wiki/Big5#ETEN_extensions, https://moztw.org/docs/big5/table/eten.txt) which is so popular that it is a defacto standard. Big5-2003 (mentioned in a comment below) seems to be an extension of Big5-ETen. For what it's worth, whatwg includes these mappings in their own big5 reference tables: https://encoding.spec.whatwg.org/big5.html.

Unfortunately Big5 is still in common use in Taiwan. It's pretty funny that Python fails to decode Big5 documents containing the name of one of Taiwan's largest multinationals :-)

vstinner commented 3 years ago

It looks like the F9D6–F9FE characters all come from the Big5-ETen extension

One option would be to add a new big5eten encoding to Python. Someone has to implement the code.