Closed fc233844-b7d8-4db9-bc24-cd51bb8e9bda closed 7 years ago
Using Python 2.7.12
>>> u"\ud83d".encode("utf-16-le")
'=\xd8'
>>> u"\ud83d".encode("utf-16-le").decode("utf-16-le")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data
>>>
Same problem on 3.3.6. But works on 3.4.5. So I guess this was fixed but not backported.
With the latest build, even encode will fail:
Python 3.6.0a4+ (default:dad4c42869f6, Sep 6 2016, 21:41:38)
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\ud83d".encode("utf-16-le")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Probably Python 2's UTF-16 decoder should be as broken as the encoder, which will match the broken behavior of the UTF-8 and UTF-32 codecs:
>>> u'\ud83d\uda12'.encode('utf-8').decode('utf-8')
u'\ud83d\uda12'
>>> u'\ud83d\uda12'.encode('utf-32-le').decode('utf-32-le')
u'\ud83d\uda12'
Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates:
>>> u'\ud83d\uda12'.encode('utf-16le', 'surrogatepass')
b'=\xd8\x12\xda'
>>> _.decode('utf-16le', 'surrogatepass')
'\ud83d\uda12'
On Tue, Sep 6, 2016 at 3:43 PM, Xiang Zhang \report@bugs.python.org\ wrote:
Xiang Zhang added the comment:
With the latest build, even encode will fail:
With Python 3 you have to use the "surrogatepass" error handler. I assumed this was the default in Python 2 since it worked with other codecs.
On Tue, Sep 6, 2016 at 4:10 PM, Eryk Sun \report@bugs.python.org\ wrote:
Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates:
To add some context: I was writing tests for windows paths containing surrogates (e.g. os.listdir can return them)
UTF codecs must not encode surrogate characters: http://unicodebook.readthedocs.io/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates
Python 3 is right, sadly it's too late to fix Python 2.
Victor, it seems the only option here (other than closing this as won't fix) is to modify the UTF-16 decoder in 2.7 to allow lone surrogates, which would be consistent with the UTF-8 and UTF-32 decoders. While it's too late to enforce strict compliance in 2.7, it shouldn't hurt to expand the domain of acceptable encodings. Then if surrogates are always passed in 2.7, a silently ignored "surrogatepass" handler could be added for compatibility with 3.x code.
I dislike the idea of changing the behaviour in a minor release :-/
Unless the 2.7 docs specify that the utf codecs should violate the standard with respect to lone surrogates, I think this should definitely be closed (as 'not a bug').
Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug. It's misbehavior, and it either will be or won't be fixed. From Victor's response it's looking like the latter.
Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug.
I didn't say that it's not a bug. I said that it's not possible to modify a codec at this point in Python 2.7 without taking a risk of breaking applications relying on the current behaviour. Even in Python 3, we don't do such change in minor releases, but only in major releases.
I wasn't trying to put words in your mouth, Victor. I was replying to Terry (msg275406).
Closing as wontfix if there are concerns regarding compatibility seems fine to me.
Thanks for looking into this.
I've also found a workaround for my usecase in the meantime: https://github.com/lazka/senf/commit/b7dadb05a29db5f0d74f659971b0a86d5e579028
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['expert-unicode']
title = "utf-16 decoding can't handle lone surrogates"
updated_at =
user = 'https://github.com/lazka'
```
bugs.python.org fields:
```python
activity =
actor = 'lazka'
assignee = 'none'
closed = True
closed_date =
closer = 'lazka'
components = ['Unicode']
creation =
creator = 'lazka'
dependencies = []
files = []
hgrepos = []
issue_num = 27971
keywords = []
message_count = 14.0
messages = ['274546', '274548', '274555', '274556', '274558', '274560', '274565', '274593', '274620', '275406', '275483', '275495', '275522', '275590']
nosy_count = 6.0
nosy_names = ['terry.reedy', 'vstinner', 'ezio.melotti', 'eryksun', 'xiang.zhang', 'lazka']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue27971'
versions = ['Python 2.7']
```