python / cpython

The Python programming language
https://www.python.org
Other
63.44k stars 30.38k forks source link

Possibly Wrong json encoding and decoding of surrogated pairs #94527

Open zommiommy opened 2 years ago

zommiommy commented 2 years ago

Bug report Hi, I found a small non-urgent bug so feel free to ignore it.

import json
assert '\udb0a\udfdf' == json.loads(json.dumps('\udb0a\udfdf'))

results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

Your environment I managed to reproduce it in your maylinux2010 docker for the versions 3.6, 3.7, 3.8, 3.9, 3.10.

sudo docker run -it manylinux2010
zommiommy commented 2 years ago

I found this using:

import sys
import atheris

with atheris.instrument_imports():
  import json

def harness(data):
    fdp = atheris.FuzzedDataProvider(data)
    obj = fdp.ConsumeString(fdp.ConsumeIntInRange(0, 8))
    new = json.loads(json.dumps(obj))
    assert obj == new, "DIFFERENT!: {} {}".format(repr(obj), repr(new))

atheris.Setup(sys.argv, harness)
atheris.Fuzz()
python fuzz.py
zommiommy commented 2 years ago

Those are surrogate pairs, so I guess that json is doing the right thing here. Anyway it's a wierd behavior.

jakejack13 commented 2 years ago

First time contributor here. I started triaging this issue and it seems like the following passes.

import json
assert '\udb0a\udfdf' == json.loads(json.dumps('\udb0a\udfdf', ensure_ascii=False))

According to the json module documentation, setting ensure_ascii to False allows the returned string from json.dumps to contain non-ASCII characters from the input object instead of being escaped. Apparently, not being escaped causes the encoder during json.loads to treat the string as is instead of being processed as an actual Unicode surrogate pair. I'm not well versed in Unicode encoding so I am unsure if this is intended behavior or not but I hope that this helps.

zommiommy commented 2 years ago

Thank you! Yeah with ensure_ascii=False it works.

From my understanding, python stores strings using the encoding which uses less memory between UCS-1, UCS-2, and UCS-4. So '\udb0a\udfdf' is stored as UCS-2 chars because they are both under 0xFFFFu, but they are decoded as '\U000d2bdf' which is stored as UCS-4.

Since '\U000d2bdf' is a valid UCS-4 while '\udb0a\udfdf' are two meaningless UCS-2 chars, the library "correctly" guesses that it's a UCS-4 string. Proof:

> len('\U000d2bdf')
1
> '\U000d2bdf'.encode()
b'\xf3\x92\xaf\x9f'
> len('\udb0a\udfdf')
2
> '\udb0a\udfdf'.encode()
UnicodeEncodeError: 
'utf-8' codec can't encode characters in position 0-1: 
surrogates not allowed

The cause of this is this part of code of function py_scanstring:

...
uni = _decode_uXXXX(s, end)
end += 5
if 0xd800 <= uni <= 0xdbff and s[end:end + 2] == '\\u':
    uni2 = _decode_uXXXX(s, end + 1)
    if 0xdc00 <= uni2 <= 0xdfff:
        uni = 0x10000 + (((uni - 0xd800) << 10) | (uni2 - 0xdc00))
        end += 6
...

Indeed:

> hex(0x10000 + (((0xdb0a - 0xd800) << 10) | (0xdfdf - 0xdc00)))
'0xd2bdf'
> json.loads('"\\udb0a\\udfdf"')
'\U000d2bdf'

In this case, it's not clear to me what's the correct behavior.