Closed lejmr closed 1 year ago
+1 We suffered from the same symptoms in our cluster (orjson 3.9.4) and downgrading to 3.9.2 solved this issue for us. All application freezes we analyzed were caused by orjson.loads.
Thanks for the report. If you have time to produce a test case that would help. It could modify or fork from integration/wsgi.py or integration/thread.
What environment is this?
3.9.3 but not 3.9.2 suggests f94318a, e9b745e.
Hi @ijl,
I'm in the same team as @adlaube. First of all Thank you for orjson! We experienced a great speedup from using orjson and it is nowadays our main JSON implementation that we use within our applications.
Before the downgrade to orjson 3.9.2 (orjson-3.9.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
sha256:275b5a18fd9ed60b2720543d3ddac170051c43d680e47d04ff5203d2c6d8ebf1
) we used orjson 3.9.4 (orjson-3.9.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
sha256:1e4b20164809b21966b63e063f894927bc85391e60d0a96fa0bb552090f1319c
). In both cases we use it in CPython 3.11.3 runtime. Our application is a combination of Flask 2.3.2 + Cheroot 8.4.1 where HTTP requests and other async tasks are processed via Python multithreading and asyncio. We will try to provide a proper reproducible.
Beside the already mentioned application freezes, we also saw sporadic segfaults which seem to come from orjson but as also other C and Rust extensions are used in our application, we can't 100% confirm that yet. In case of application freezes we can prove that the application froze as the thread calling orjson didn't release the GIL anymore as it was stuck somewhere in orjson.
Can confirm, experienced hanging in python 3.11 and segfaults in python 3.9. Downgrading to 3.9.2 as others mentioned resolved the issue.
3.9.5 removes the futex. I think going further requires a test exercising what you expect to work.
Hi @ijl,
unfortunately I have not been able to reproduce this issue locally yet. Interestingly this issue did not occur in our test landscape indicating that a consistent high load is required. I have rebuilt 3.9.4 with debug symbols to produce some traces from our core dumps [0].
For future issues it might be helpful to include a minimal set of symbols: debug settings
Let me know if there is any further useful information I could be able to retrieve from the dumps for you. We will assess if the issue is still present in production with 3.9.5 and give you a feedback.
[0]
Case 0: hanging thread which holds the GIL
[Switching to thread 104 (Thread 0x7effcf7fe700 (LWP 118))]
#0 0x00007f01897c9876 in pool_malloc (size=8, ctx_ptr=<optimized out>) at include/yyjson/yyjson.c:1027
1027 include/yyjson/yyjson.c: No such file or directory.
#0 0x00007f01897c9876 in pool_malloc (size=8, ctx_ptr=<optimized out>) at include/yyjson/yyjson.c:1027
#1 pool_realloc (ctx_ptr=<optimized out>, ptr=0x7eff39072740, old_size=<optimized out>, size=8) at include/yyjson/yyjson.c:1115
#2 0x00007f01896e8700 in ?? ()
#3 0x00007f018e630e80 in ?? ()
#4 0x0000000000000000 in ?? ()
prev = <optimized out>
cur = 0x7effce7ff710
next = 0x7effce7fd050
Case 1: Segmentation fault
Sadly for us, the bug is still present. I wrote a minimal test but didn't raise the issue... will give it another shot.
I can reproduce this crash in the Zulip development environment by reloading the page a few dozen times. I’m seeing similar symptoms: various segfaults and 100% CPU freezes. The environment is Python 3.8.10 on Ubuntu 20.04.6 LTS x86-64.
The first bad commit seems to be e9b745ecd39e2d353f9815367ee66909eec507e6. Moreover, I can reproduce the crash by cherry-picking only those buffer refactoring changes in deserialize_yyjson
, YYJSON_ALLOC
, yyjson_init
onto 3.9.2 (here’s the exact code for that). Adding caae033ee1c1a0bb704643153f271a29d48fc28e doesn’t help.
Obviously Zulip is a pretty complicated application and far from an ideal minimal test case, but at least it’s open source and you can run it yourself; alternatively, I’m happy to try out any debugging suggestions you might have.
We ended up downgrading to 3.9.2 in Home Assistant since we kept getting reports of crashes. Sadly I haven't been able to replicate it myself
I’ve been noticing by running debug builds with warnings enabled that the crash is always preceded by a bunch of warnings like this:
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=86, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 51262), raddr=('127.0.0.1', 5672)>
return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=85, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 53479), raddr=('127.0.0.1', 33606)>
return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=84, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 33606), raddr=('127.0.0.1', 53479)>
return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
These warnings are not caused by orjson, of course—I see these same warnings on other random lines of code during normal execution. But they do suggest that the crash may be triggered when the Python garbage collector happens to run from inside orjson’s code.
This makes me suspect that orjson might be playing fast-and-loose with reference counting? Just from looking around at the Py_DECREF
calls, patterns like this where a reference is decremented before being used seem very suspicious:
The entire concept of py_decref_without_destroy
is suspicious as well, since it breaks the invariant that live objects have a nonzero ob_refcnt
.
I haven’t pinned this crash on any particular reference counting bug, but this genre of bug is the kind of thing that’s likely to produce rare crashes depending on when Python happens to invoke the garbage collector.
Those are CPython implementation details. PyObject_GetAttr()
increfs when returning on an object that must already have a non-zero refcnt. insertdict()
increfs, but the deserialized object was only created to populate the container. a52679acd34eef715c1932281170edb405dccca9 was a refcounting error.
3.9.6 was tagged and could be run against your test suite, possibly with debug-assertions = true
, but I don't think anything relevant was changed.
Yeah, 3.9.6 still crashes. Here’s an interesting panic:
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=60, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 41958), raddr=('127.0.0.1', 54115)>
return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=61, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 54115), raddr=('127.0.0.1', 41958)>
return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=62, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 50170), raddr=('127.0.0.1', 5672)>
return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
thread '<unnamed>' panicked at 'range start index 14 out of range for slice of length 13', /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/encoding_rs-0.8.33/src/mem.rs:419:30
stack backtrace:
0: rust_begin_unwind
at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:593:5
1: core::panicking::panic_fmt
at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/panicking.rs:67:14
2: core::slice::index::slice_start_index_len_fail_rt
at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/slice/index.rs:52:5
3: core::slice::index::slice_start_index_len_fail
at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/slice/index.rs:40:9
4: orjson::str::create::unicode_from_str
5: orjson::deserialize::yyjson::parse_yy_object
6: orjson::deserialize::yyjson::parse_yy_array
7: orjson::deserialize::yyjson::parse_yy_object
8: orjson::deserialize::yyjson::deserialize_yyjson
9: loads
10: cfunction_vectorcall_O
at ./build-debug/../Objects/methodobject.c:482:24
Here’s a question. What happens if orjson.loads
allocates a Python object that triggers a garbage collection that invokes a destructor that calls orjson.loads
again? (Or the destructor releases the GIL and a different thread calls orjson.loads
?) That reentrancy will corrupt the YYJSON_ALLOC
buffer, right?
import orjson
class C:
def __del__(self):
orjson.loads('"' + "a" * 10000 + '"')
c = C()
c.c = c
del c
orjson.loads("[" + "[]," * 1000 + "[]]")
$ python reentrant.py
Illegal instruction (core dumped)
Proposed fix:
This is fixed in 3.9.7. Thank you everyone for the logs, investigation, and fix.
Thank you ijl and andersk and everyone else! :-)
After upgrading to 3.9.4 from 3.9.2, I started to experience freezing of my application. With the downgrade, I can confirm the same issue happens with 3.9.3 but not with 3.9.2!
My application utilizes threads for receiving messages from multiple topics. When I receive two messages almost simultaneously (below 8ms time difference), the application freezes on orjson.loads call. Freezes = thread that stucks on json.loads consumes 100% of CPU, and all other threads stop. Nothing happens at all - even prometheus_client is not providing any metrics, i.e., HTTP interface timeouts.
For troubleshooting, I used strace which shows only repeating lines:
full strace: strace.txt