ijl / orjson

Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy
Apache License 2.0
6.23k stars 215 forks source link

Newlines in strings cause unaligned writes #498

Closed matoro closed 4 months ago

matoro commented 4 months ago

Hi, I've been attempting to run the orjson test suite on hardware which does not support unaligned accesses, and discovered that attempting to deserialize any string value containing a newline triggers an unaligned write. This is undefined behavior in C/C++, my understanding about what it means for Rust is a little fuzzier but I believe since the write takes place in unsafe code it is also undefined behavior.

My testing is done against the latest tag at the time, 3.10.4, with python 3.12.4, and compiled with rust 1.77.1, on kernel 6.9/glibc 2.39. I've narrowed down the minimized reproducer to the following:

$ PYTHONPATH=../orjson-3.10.4-python3_12/install/usr/lib/python3.12/site-packages python3.12 -c 'import orjson ; orjson.dumps("\n")'
Bus error (core dumped)

Here is the stack trace:

#0  core::ptr::write<u64> (dst=0x10000306731, src=6660260898927542274) at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/ptr/mod.rs:1415
#1  0xfff800010186507c in orjson::serialize::writer::json::format_escaped_str<&mut orjson::serialize::writer::byteswriter::BytesWriter> (writer=0x7feffd827b8, value=...) at src/serialize/writer/json.rs:633
#2  orjson::serialize::writer::json::{impl#3}::serialize_str<&mut orjson::serialize::writer::byteswriter::BytesWriter, orjson::serialize::writer::formatter::CompactFormatter> (self=0x7feffd827b8, value=...) at src/serialize/writer/json.rs:154
#3  orjson::serialize::per_type::unicode::{impl#1}::serialize<&mut orjson::serialize::writer::json::Serializer<&mut orjson::serialize::writer::byteswriter::BytesWriter, orjson::serialize::writer::formatter::CompactFormatter>> (self=0x7feffd82380, serializer=0x7feffd827b8) at src/serialize/per_type/unicode.rs:29
#4  0xfff800010186830c in orjson::serialize::serializer::{impl#1}::serialize<&mut orjson::serialize::writer::json::Serializer<&mut orjson::serialize::writer::byteswriter::BytesWriter, orjson::serialize::writer::formatter::CompactFormatter>> (self=0x7feffd828e8, serializer=0x7feffd827b8) at src/serialize/serializer.rs:69
#5  0xfff80001018778ac in orjson::serialize::writer::json::to_writer<&mut orjson::serialize::writer::byteswriter::BytesWriter, orjson::serialize::serializer::PyObjectSerializer> (writer=0x7feffd828d0, value=0x7feffd828e8) at src/serialize/writer/json.rs:649
#6  0xfff8000101866974 in orjson::serialize::serializer::serialize (ptr=0xfff8000100e07b88 <_PyRuntime+62328>, default=..., opts=0) at src/serialize/serializer.rs:25
#7  0xfff800010187d7b4 in orjson::dumps (_self=0x0, args=0xfff800010002a078, nargs=1, kwnames=0x0) at src/lib.rs:382
#8  0xfff8000100684f2c in cfunction_vectorcall_FASTCALL_KEYWORDS (func=<built-in function dumps>, args=0xfff800010002a078, nargsf=9223372036854775809, kwnames=0x0) at Objects/methodobject.c:438
#9  0xfff80001005cb668 in _PyObject_VectorcallTstate (tstate=0xfff8000100e68b58 <_PyRuntime+459592>, callable=<built-in function dumps>, args=0xfff800010002a078, nargsf=9223372036854775809, kwnames=0x0) at ./Include/internal/pycore_call.h:92
#10 0xfff80001005ccda8 in PyObject_Vectorcall (callable=<built-in function dumps>, args=0xfff800010002a078, nargsf=9223372036854775809, kwnames=0x0) at Objects/call.c:325
#11 0xfff80001008446f4 in _PyEval_EvalFrameDefault (tstate=0xfff8000100e68b58 <_PyRuntime+459592>, frame=0xfff800010002a020, throwflag=0) at Python/bytecodes.c:2706
#12 0xfff8000100810200 in _PyEval_EvalFrame (tstate=0xfff8000100e68b58 <_PyRuntime+459592>, frame=0xfff800010002a020, throwflag=0) at ./Include/internal/pycore_ceval.h:89
#13 0xfff800010085acb4 in _PyEval_Vector (tstate=0xfff8000100e68b58 <_PyRuntime+459592>, func=0xfff800010011a160, locals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <type at remote 0x10000246f00>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0xfff80001000d5940>, 'orjson': <module at remote 0xfff800010020e070>}, args=0x0, argcount=0, kwnames=0x0) at Python/ceval.c:1683
#14 0xfff8000100812820 in PyEval_EvalCode (co=<code at remote 0xfff800010007d020>, globals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <type at remote 0x10000246f00>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0xfff80001000d5940>, 'orjson': <module at remote 0xfff800010020e070>}, locals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <type at remote 0x10000246f00>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0xfff80001000d5940>, 'orjson': <module at remote 0xfff800010020e070>}) at Python/ceval.c:578
#15 0xfff800010092aff8 in run_eval_code_obj (tstate=0xfff8000100e68b58 <_PyRuntime+459592>, co=0xfff800010007d020, globals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <type at remote 0x10000246f00>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0xfff80001000d5940>, 'orjson': <module at remote 0xfff800010020e070>}, locals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <type at remote 0x10000246f00>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0xfff80001000d5940>, 'orjson': <module at remote 0xfff800010020e070>}) at Python/pythonrun.c:1722
#16 0xfff800010092b1a4 in run_mod (mod=0x1000029bc38, filename='<string>', globals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <type at remote 0x10000246f00>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0xfff80001000d5940>, 'orjson': <module at remote 0xfff800010020e070>}, locals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <type at remote 0x10000246f00>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0xfff80001000d5940>, 'orjson': <module at remote 0xfff800010020e070>}, flags=0x7feffd867e0, arena=0xfff80001000b5cb0) at Python/pythonrun.c:1743
#17 0xfff800010092ab2c in PyRun_StringFlags (str=0xfff80001000e88c0 "import orjson ; orjson.dumps(\"\\n\")\n", start=257, globals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <type at remote 0x10000246f00>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0xfff80001000d5940>, 'orjson': <module at remote 0xfff800010020e070>}, locals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <type at remote 0x10000246f00>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0xfff80001000d5940>, 'orjson': <module at remote 0xfff800010020e070>}, flags=0x7feffd867e0) at Python/pythonrun.c:1618
#18 0xfff8000100925df8 in PyRun_SimpleStringFlags (command=0xfff80001000e88c0 "import orjson ; orjson.dumps(\"\\n\")\n", flags=0x7feffd867e0) at Python/pythonrun.c:480
#19 0xfff800010097be04 in pymain_run_command (command=0x10000203960 L"import orjson ; orjson.dumps(\"\\n\")\n") at Modules/main.c:255
#20 0xfff800010097d8dc in pymain_run_python (exitcode=0x7feffd86a34) at Modules/main.c:620
#21 0xfff800010097dbf8 in Py_RunMain () at Modules/main.c:709
#22 0xfff800010097dd30 in pymain_main (args=0x7feffd86bf8) at Modules/main.c:739
#23 0xfff800010097de28 in Py_BytesMain (argc=3, argv=0x7feffd87138) at Modules/main.c:763
#24 0x0000010000000914 in main (argc=3, argv=0x7feffd87138) at ./Programs/python.c:15

And here's a trace with snippets and full locals in case it helps: gdb.txt

Normally for reporting these in C/C++ I will provide a UBSAN run, as unaligned access is UB on all platforms and thus detected by UBSAN even on those which allow it in hardware. Unfortunately I don't know what the equivalent is in rust...I offer free shell access to the hardware on which I reproduced this available here. Or you could possibly reproduce in QEMU sparc64. Let me know if there is any additional information I can provide.

ijl commented 4 months ago

Thanks for the report. I changed it to a normal memcpy/core::ptr::copy_nonoverlapping in 3.10.5.