Open zommiommy opened 2 years ago
I found this using:
import sys
import atheris
with atheris.instrument_imports():
import json
def harness(data):
fdp = atheris.FuzzedDataProvider(data)
obj = fdp.ConsumeString(fdp.ConsumeIntInRange(0, 8))
new = json.loads(json.dumps(obj))
assert obj == new, "DIFFERENT!: {} {}".format(repr(obj), repr(new))
atheris.Setup(sys.argv, harness)
atheris.Fuzz()
python fuzz.py
Those are surrogate pairs, so I guess that json is doing the right thing here. Anyway it's a wierd behavior.
First time contributor here. I started triaging this issue and it seems like the following passes.
import json
assert '\udb0a\udfdf' == json.loads(json.dumps('\udb0a\udfdf', ensure_ascii=False))
According to the json module documentation, setting ensure_ascii
to False allows the returned string from json.dumps
to contain non-ASCII characters from the input object instead of being escaped. Apparently, not being escaped causes the encoder during json.loads
to treat the string as is instead of being processed as an actual Unicode surrogate pair. I'm not well versed in Unicode encoding so I am unsure if this is intended behavior or not but I hope that this helps.
Thank you! Yeah with ensure_ascii=False
it works.
From my understanding, python stores strings using the encoding which uses less memory between UCS-1
, UCS-2
, and UCS-4
.
So '\udb0a\udfdf'
is stored as UCS-2
chars because they are both under 0xFFFFu
, but they are decoded as '\U000d2bdf'
which is stored as UCS-4
.
Since '\U000d2bdf'
is a valid UCS-4
while '\udb0a\udfdf'
are two meaningless UCS-2
chars, the library "correctly" guesses that it's a UCS-4
string.
Proof:
> len('\U000d2bdf')
1
> '\U000d2bdf'.encode()
b'\xf3\x92\xaf\x9f'
> len('\udb0a\udfdf')
2
> '\udb0a\udfdf'.encode()
UnicodeEncodeError:
'utf-8' codec can't encode characters in position 0-1:
surrogates not allowed
The cause of this is this part of code of function py_scanstring
:
...
uni = _decode_uXXXX(s, end)
end += 5
if 0xd800 <= uni <= 0xdbff and s[end:end + 2] == '\\u':
uni2 = _decode_uXXXX(s, end + 1)
if 0xdc00 <= uni2 <= 0xdfff:
uni = 0x10000 + (((uni - 0xd800) << 10) | (uni2 - 0xdc00))
end += 6
...
Indeed:
> hex(0x10000 + (((0xdb0a - 0xd800) << 10) | (0xdfdf - 0xdc00)))
'0xd2bdf'
> json.loads('"\\udb0a\\udfdf"')
'\U000d2bdf'
In this case, it's not clear to me what's the correct behavior.
UCS-2
chars '\udb0a\udfdf'
to the meaningfull single UCS-4
char '\U000d2bdf'
can be a way to help the user. '\U000d2bdf'
and '\udb0a\udfdf'
have different meanings so I don't know if it's "fair" to change what the user wrote, mainly because I can't find a reason why he user should use the UCS-2
encoding if he meant the UCS-4
one.
Bug report Hi, I found a small non-urgent bug so feel free to ignore it.
results in:
Your environment I managed to reproduce it in your maylinux2010 docker for the versions 3.6, 3.7, 3.8, 3.9, 3.10.