python / cpython

The Python programming language
https://www.python.org
Other
63.44k stars 30.38k forks source link

array.array of UCS2 values #59240

Open ronaldoussoren opened 12 years ago

ronaldoussoren commented 12 years ago
BPO 15035
Nosy @loewis, @ronaldoussoren, @ncoghlan, @tiran, @methane, @skrah

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['extension-modules', 'type-bug'] title = 'array.array of UCS2 values' updated_at = user = 'https://github.com/ronaldoussoren' ``` bugs.python.org fields: ```python activity = actor = 'methane' assignee = 'none' closed = False closed_date = None closer = None components = ['Extension Modules'] creation = creator = 'ronaldoussoren' dependencies = [] files = [] hgrepos = [] issue_num = 15035 keywords = [] message_count = 7.0 messages = ['162520', '162521', '162522', '168374', '168376', '168378', '168379'] nosy_count = 7.0 nosy_names = ['loewis', 'ronaldoussoren', 'ncoghlan', 'christian.heimes', 'Arfrever', 'methane', 'skrah'] pr_nums = [] priority = 'high' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue15035' versions = ['Python 3.4'] ```

ronaldoussoren commented 12 years ago

I'm sometimes using an array.array with format character "u" as a writable backing store for buffers shared with platform APIs that access buffers of UCS2 values. This works fine in python 3.2 and earlier with a ucs2 build of python, but no longer works with python 3.3 because the "u" character explicitly selects a UCS4 representation in that version.

An example of how I use this is using PyObjC on MacOSX, for example:

b = array.array('u', "hello world")
s = CFStringCreateMutableWithExternalCharactersNoCopy(                      
        None, b, len(b), len(b), kCFAllocatorNull)

"s" now refers to a mutable Objective-C string that uses "b" as its backing store.

It would be nice if there were a format code that would allow me to do this with Python 3.3, for example b = array.array("U", ...)

(BTW. I'm sorry if this is a duplicate, searching for "array.array" on the tracker results in a lot of hits, most of which have nothing to do with the array module)

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 12 years ago

See also bpo-13072 and the discussion starting at:

http://mail.python.org/pipermail/python-dev/2012-March/117390.html

I think the priority should be "high", since the current behavior doesn't preserve the status quo. Also, PEP-3118 suggests 'u' for UCS2 and 'w' for UCS4.

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 12 years ago

Hmm, obviously the discussion starts here:

http://mail.python.org/pipermail/python-dev/2012-March/117376.html

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 12 years ago

This one should be fixed by bpo-13072. Could you check again?

ncoghlan commented 12 years ago

As Stefan noted, so long as Py_UNICODE is 16 bits in the Mac OS X builds, then this should now be back to the 3.2 behaviour.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

It's not back to the 3.2 behavior. In 3.3, Py_UNICODE is always equal to wchar_t, which is a 4-byte type on Darwin. However, CFString is based on UniChar, which is a 2-byte type.

That this worked in 3.2 was by accident - it would work only in "narrow" builds. Python's configure in 3.2 and before wouldn't default to using wchar_t on Darwin since it didn't consider wchar_t "usable", which in turn happened because wchar_t is signed on Darwin, but Py_UNICODE was understood to be unsigned.

Since it's too late to add an 'U' code to 3.3, as a work-around, you would have to use a 'H' array, and initialize it with map(ord, the_string)).

Chances are good that a proper UCS-2 array code gets added to 3.4.

ronaldoussoren commented 12 years ago

Py_UNICODE is an typedef for wchar_t and that type is 4 bytes long:

>>> a.tobytes()
b'h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> a = array.array('u', 'bar')
>>> a.tobytes()
b'b\x00\x00\x00a\x00\x00\x00r\x00\x00\x00'
>>> len(a.tobytes())
12
>>> 

This is with a checkout that was created yesterday.

The issue is not resolved, there now is no way to easily create a UCS2 buffer; while there was in earlier releases of Python (with the default narrow build)