python / cpython

The Python programming language
https://www.python.org
Other
63.49k stars 30.41k forks source link

No EUDC (HKSCS) support in Windows cp950 #72879

Open e89516b6-0232-436d-aee8-86acbe2ef142 opened 8 years ago

e89516b6-0232-436d-aee8-86acbe2ef142 commented 8 years ago
BPO 28693
Nosy @pfmoore, @vstinner, @tjguk, @ezio-melotti, @zware, @zooba, @Artoria2e5

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', '3.7', 'expert-unicode', 'OS-windows'] title = 'No EUDC (HKSCS) support in Windows cp950' updated_at = user = 'https://github.com/Artoria2e5' ``` bugs.python.org fields: ```python activity = actor = 'Artoria2e5' assignee = 'none' closed = False closed_date = None closer = None components = ['Unicode', 'Windows'] creation = creator = 'Artoria2e5' dependencies = [] files = [] hgrepos = [] issue_num = 28693 keywords = [] message_count = 4.0 messages = ['280811', '280828', '280969', '281654'] nosy_count = 7.0 nosy_names = ['paul.moore', 'vstinner', 'tim.golden', 'ezio.melotti', 'zach.ware', 'steve.dower', 'Artoria2e5'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue28693' versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7'] ```

e89516b6-0232-436d-aee8-86acbe2ef142 commented 8 years ago

Python's cp950 implementation lacks support for HKSCS ('big5hkscs'). This support, which maps HKSCS Big5-EUDC code points to Unicode PUA code points algorithmically, is found in Windows Vista+ as well as an update for XP.

An experiment session is shown below. I will use '2>>>' to denote a Win32 build of Python 2.7.10 running under a console window set to cp950 (via chcp), and '3>>>' to denote a Python 3.4.3 build running under Cygwin's UTF-8 mintty. HKSCS-2008's table is used http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/hkscs-2008-big5-iso.txt for a list of HKSCS characters; note though, its non-PUA mappings are not found in Windows.

Let's start with the first character in that list.

3>>> u'\u43F0'
'䏰'
3>>> print(u'\uF266') # provisional PUA

3>>> u'\u43F0'.encode('cp950') # FAIL
3>>> u'\uF266'.encode('cp950') # FAIL
3>>> u'\u43F0'.encode('hkscs')
b'\x87@'
3>>> u'\uF266'.encode('hkscs') # FAIL`

These experiments above show how Python 3 handles HKSCS characters, and how U+43F0 should normally be encoded. Now let's switch to Windows console, which would be using Windows' decode-to-Unicode routine for cp950.

2>>> print b'\x87@'


Let's try to identify this character:

3>>> u''
'\uf266'

So indeed there is some sort of HKSCS going on. But note what Windows has is really not any kind of new HKSCS:

Big5 ucs93 ucs00 ucs03 + 1-6 876B 9734 9734 9734 876C F292 F292 27BEF 876D 5BDB 5BDB 5BDB

2>>> print b'\x87\x6b,\x87\x6c,\x87\x6d'
,,
3>>> u',,'
'\uf291,\uf292,\uf293'

Just as for all other code pages, you can always find Microsoft's mapping at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt. If you are uncomfortable with adding a whole new table and wasting space (this is done for hkscs btw), use the algorithmic mapping at https://en.wikipedia.org/wiki/Code_page_950.

vstinner commented 8 years ago

Python supports native Windows code pages using codecs.code_page_encode() and codecs.code_page_decode() methods. See for example Lib/encodings/cp65001.py : this codec is not implemented in Python, but is a wrapper to native Windows functions (MultiByteToWideChar and WideCharToMultiByte).

e89516b6-0232-436d-aee8-86acbe2ef142 commented 8 years ago

Update: the test script at bpo-28712 can be modified to show this issue too.

e89516b6-0232-436d-aee8-86acbe2ef142 commented 7 years ago

Windows cp950's EUDC\<->PUA mapping is not specific to HKSCS.