Possibly Wrong json encoding and decoding of surrogated pairs

zommiommy commented 2 years ago

Bug report Hi, I found a small non-urgent bug so feel free to ignore it.

import json
assert '\udb0a\udfdf' == json.loads(json.dumps('\udb0a\udfdf'))

results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

Your environment I managed to reproduce it in your maylinux2010 docker for the versions 3.6, 3.7, 3.8, 3.9, 3.10.

sudo docker run -it manylinux2010

zommiommy commented 2 years ago

I found this using:

import sys
import atheris

with atheris.instrument_imports():
  import json

def harness(data):
    fdp = atheris.FuzzedDataProvider(data)
    obj = fdp.ConsumeString(fdp.ConsumeIntInRange(0, 8))
    new = json.loads(json.dumps(obj))
    assert obj == new, "DIFFERENT!: {} {}".format(repr(obj), repr(new))

atheris.Setup(sys.argv, harness)
atheris.Fuzz()

python fuzz.py

zommiommy commented 2 years ago

Those are surrogate pairs, so I guess that json is doing the right thing here. Anyway it's a wierd behavior.

jakejack13 commented 2 years ago

First time contributor here. I started triaging this issue and it seems like the following passes.

import json
assert '\udb0a\udfdf' == json.loads(json.dumps('\udb0a\udfdf', ensure_ascii=False))

According to the json module documentation, setting ensure_ascii to False allows the returned string from json.dumps to contain non-ASCII characters from the input object instead of being escaped. Apparently, not being escaped causes the encoder during json.loads to treat the string as is instead of being processed as an actual Unicode surrogate pair. I'm not well versed in Unicode encoding so I am unsure if this is intended behavior or not but I hope that this helps.

zommiommy commented 2 years ago

Thank you! Yeah with ensure_ascii=False it works.

From my understanding, python stores strings using the encoding which uses less memory between UCS-1, UCS-2, and UCS-4. So '\udb0a\udfdf' is stored as UCS-2 chars because they are both under 0xFFFFu, but they are decoded as '\U000d2bdf' which is stored as UCS-4.

Since '\U000d2bdf' is a valid UCS-4 while '\udb0a\udfdf' are two meaningless UCS-2 chars, the library "correctly" guesses that it's a UCS-4 string. Proof:

> len('\U000d2bdf')
1
> '\U000d2bdf'.encode()
b'\xf3\x92\xaf\x9f'
> len('\udb0a\udfdf')
2
> '\udb0a\udfdf'.encode()
UnicodeEncodeError: 
'utf-8' codec can't encode characters in position 0-1: 
surrogates not allowed

The cause of this is this part of code of function py_scanstring:

...
uni = _decode_uXXXX(s, end)
end += 5
if 0xd800 <= uni <= 0xdbff and s[end:end + 2] == '\\u':
    uni2 = _decode_uXXXX(s, end + 1)
    if 0xdc00 <= uni2 <= 0xdfff:
        uni = 0x10000 + (((uni - 0xd800) << 10) | (uni2 - 0xdc00))
        end += 6
...

Indeed:

> hex(0x10000 + (((0xdb0a - 0xd800) << 10) | (0xdfdf - 0xdc00)))
'0xd2bdf'
> json.loads('"\\udb0a\\udfdf"')
'\U000d2bdf'

In this case, it's not clear to me what's the correct behavior.

Correcting the meaningless pair of UCS-2 chars '\udb0a\udfdf' to the meaningfull single UCS-4 char '\U000d2bdf' can be a way to help the user.
On the other hand, '\U000d2bdf' and '\udb0a\udfdf' have different meanings so I don't know if it's "fair" to change what the user wrote, mainly because I can't find a reason why he user should use the UCS-2 encoding if he meant the UCS-4 one.

python / cpython

Possibly Wrong json encoding and decoding of surrogated pairs #94527