utf-16 decoding can't handle lone surrogates

fc233844-b7d8-4db9-bc24-cd51bb8e9bda commented 8 years ago

BPO	27971
Nosy	@terryjreedy, @vstinner, @ezio-melotti, @eryksun, @zhangyangyu, @lazka

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['expert-unicode'] title = "utf-16 decoding can't handle lone surrogates" updated_at = user = 'https://github.com/lazka' ``` bugs.python.org fields: ```python activity = actor = 'lazka' assignee = 'none' closed = True closed_date = closer = 'lazka' components = ['Unicode'] creation = creator = 'lazka' dependencies = [] files = [] hgrepos = [] issue_num = 27971 keywords = [] message_count = 14.0 messages = ['274546', '274548', '274555', '274556', '274558', '274560', '274565', '274593', '274620', '275406', '275483', '275495', '275522', '275590'] nosy_count = 6.0 nosy_names = ['terry.reedy', 'vstinner', 'ezio.melotti', 'eryksun', 'xiang.zhang', 'lazka'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = None status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue27971' versions = ['Python 2.7'] ```

fc233844-b7d8-4db9-bc24-cd51bb8e9bda commented 8 years ago

Using Python 2.7.12

>>> u"\ud83d".encode("utf-16-le")
'=\xd8'
>>> u"\ud83d".encode("utf-16-le").decode("utf-16-le")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data
>>>

fc233844-b7d8-4db9-bc24-cd51bb8e9bda commented 8 years ago

Same problem on 3.3.6. But works on 3.4.5. So I guess this was fixed but not backported.

zhangyangyu commented 8 years ago

With the latest build, even encode will fail:

Python 3.6.0a4+ (default:dad4c42869f6, Sep  6 2016, 21:41:38) 
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\ud83d".encode("utf-16-le")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud83d' in position 0: surrogates not allowed

eryksun commented 8 years ago

Probably Python 2's UTF-16 decoder should be as broken as the encoder, which will match the broken behavior of the UTF-8 and UTF-32 codecs:

    >>> u'\ud83d\uda12'.encode('utf-8').decode('utf-8')
    u'\ud83d\uda12'
    >>> u'\ud83d\uda12'.encode('utf-32-le').decode('utf-32-le')
    u'\ud83d\uda12'

Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates:

    >>> u'\ud83d\uda12'.encode('utf-16le', 'surrogatepass')
    b'=\xd8\x12\xda'
    >>> _.decode('utf-16le', 'surrogatepass')
    '\ud83d\uda12'

fc233844-b7d8-4db9-bc24-cd51bb8e9bda commented 8 years ago

On Tue, Sep 6, 2016 at 3:43 PM, Xiang Zhang \report@bugs.python.org\ wrote:

Xiang Zhang added the comment:

With the latest build, even encode will fail:

With Python 3 you have to use the "surrogatepass" error handler. I assumed this was the default in Python 2 since it worked with other codecs.

fc233844-b7d8-4db9-bc24-cd51bb8e9bda commented 8 years ago

On Tue, Sep 6, 2016 at 4:10 PM, Eryk Sun \report@bugs.python.org\ wrote:

Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates:

To add some context: I was writing tests for windows paths containing surrogates (e.g. os.listdir can return them)

vstinner commented 8 years ago

UTF codecs must not encode surrogate characters: http://unicodebook.readthedocs.io/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates

Python 3 is right, sadly it's too late to fix Python 2.

eryksun commented 8 years ago

Victor, it seems the only option here (other than closing this as won't fix) is to modify the UTF-16 decoder in 2.7 to allow lone surrogates, which would be consistent with the UTF-8 and UTF-32 decoders. While it's too late to enforce strict compliance in 2.7, it shouldn't hurt to expand the domain of acceptable encodings. Then if surrogates are always passed in 2.7, a silently ignored "surrogatepass" handler could be added for compatibility with 3.x code.

vstinner commented 8 years ago

I dislike the idea of changing the behaviour in a minor release :-/

terryjreedy commented 8 years ago

Unless the 2.7 docs specify that the utf codecs should violate the standard with respect to lone surrogates, I think this should definitely be closed (as 'not a bug').

eryksun commented 8 years ago

Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug. It's misbehavior, and it either will be or won't be fixed. From Victor's response it's looking like the latter.

vstinner commented 8 years ago

Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug.

I didn't say that it's not a bug. I said that it's not possible to modify a codec at this point in Python 2.7 without taking a risk of breaking applications relying on the current behaviour. Even in Python 3, we don't do such change in minor releases, but only in major releases.

eryksun commented 8 years ago

I wasn't trying to put words in your mouth, Victor. I was replying to Terry (msg275406).

fc233844-b7d8-4db9-bc24-cd51bb8e9bda commented 8 years ago

Closing as wontfix if there are concerns regarding compatibility seems fine to me.

Thanks for looking into this.

I've also found a workaround for my usecase in the meantime: https://github.com/lazka/senf/commit/b7dadb05a29db5f0d74f659971b0a86d5e579028

python / cpython

utf-16 decoding can't handle lone surrogates #72158