Open 6a5168ff-494a-4aaf-b54c-db28be967cf9 opened 14 years ago
using iconv: $ printf "\xf9\xd8" | iconv -f big5 -t utf-8 | xxd 0000000: e8a3 8f ... $ printf "\xe8\xa3\x8f" | iconv -f utf-8 -t big5 | xxd 0000000: f9d8 ..
using python
>>> print "\xf9\xd8".decode("big5")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence
>>> print "\xe8\xa3\x8f".decode("utf-8").encode("big5")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'big5' codec can't encode character u'\u88cf' in position 0: illegal multibyte sequence
That iconv supports it is not convincing, IMO. Do you have other sources (like tables in the web somewhere) that support your request?
In particular, the Unicode consortium mapping table, now at
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
doesn't map f9d8 to anything; the current version of that table (in unihan.zip) has these mappings for U+88CF:
U+88CF kCCCII 232E61 U+88CF kCNS1986 E-444E U+88CF kCNS1992 3-444E U+88CF kEACC 215763 U+88CF kGB1 3279 U+88CF kHKSCS F9D8 U+88CF kJis0 4602 U+88CF kKPS0 D9E0 U+88CF kKSC0 5574 U+88CF kTaiwanTelegraph 5937 U+88CF kXerox 241:102
As you can see, it isn't supported in big5.
sure after enlighten by your url which is OBSOLETE see: http://www.unicode.org/Public/MAPPINGS/EASTASIA/ReadMe.txt i found http://unicode.org/charts/unihan.html then http://www.unicode.org/Public/UNIDATA/ then http://www.unicode.org/Public/UNIDATA/Unihan.zip in side the zip, open Unihan_OtherMappings.txt big 5 includes
which are listed in Unihan_OtherMappings.txt HKSCS is one of the big-5 encoding and i search for F9D8 got U+88CF kHKSCS F9D8
you may also want to update other encoding map table to catch up with Unihan_OtherMappings.txt
thanks for your quick reply btw
perky, what do you think?
That iconv supports it is not convincing, ...
GNU libc is not convincing . What you talking about ?
I'm Taiwanese, F9D8 in big5 should be mapped to E8A38F in UTF-8.
I'm still looking for an official source of that.
>>> u"\u88cf".encode("big5hkscs")
'\xf9\xd8'
works fine (and always has been working fine), and the character clearly is in big5hkscs. According to
http://en.wikipedia.org/wiki/Big5
F9D8 is "Reserved for user-defined characters", so this suggests that the character does *not* have a fixed meaning in BIG-5. However, it is part of the Hong Kong Supplementary Character Set.
Inndy, you might also be talking about big5-2003, from
http://www.csie.ntu.edu.tw/~r92030/project/big5/
Python currently does not support big5-2003, but a contribution of such an encoding would surely be welcome.
As of Python 3.7.9 this also affects \xf9\xd6 which should be \u7881 in Unicode. This character is the second character of 宏碁 which is the name of the Taiwanese electronics manufacturer Acer.
You can work around the issue using big5hkscs just like with the original \xf9\xd8 problem.
It looks like the F9D6–F9FE characters all come from the Big5-ETen extension (https://en.wikipedia.org/wiki/Big5#ETEN_extensions, https://moztw.org/docs/big5/table/eten.txt) which is so popular that it is a defacto standard. Big5-2003 (mentioned in a comment below) seems to be an extension of Big5-ETen. For what it's worth, whatwg includes these mappings in their own big5 reference tables: https://encoding.spec.whatwg.org/big5.html.
Unfortunately Big5 is still in common use in Taiwan. It's pretty funny that Python fails to decode Big5 documents containing the name of one of Taiwan's largest multinationals :-)
It looks like the F9D6–F9FE characters all come from the Big5-ETen extension
One option would be to add a new big5eten encoding to Python. Someone has to implement the code.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/hyeshik' closed_at = None created_at =
labels = ['type-bug', 'expert-unicode']
title = 'Add Big5-ETen codec: Python big5 codec cannot decode \\xf9\\xd8 bytes (U+7881 expected)'
updated_at =
user = 'https://bugs.python.org/Xueferx'
```
bugs.python.org fields:
```python
activity =
actor = 'vstinner'
assignee = 'hyeshik.chang'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation =
creator = 'Xuefer.x'
dependencies = []
files = []
hgrepos = []
issue_num = 7856
keywords = []
message_count = 11.0
messages = ['98865', '98866', '98867', '98868', '98869', '98911', '218790', '218801', '218804', '388365', '388380']
nosy_count = 8.0
nosy_names = ['loewis', 'hyeshik.chang', 'vstinner', 'rpetrov', 'Xuefer.x', 'kennyluck', 'inndy', 'batterseapower']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue7856'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']
```