3.9.4 has a threading issue

lejmr commented 1 year ago

After upgrading to 3.9.4 from 3.9.2, I started to experience freezing of my application. With the downgrade, I can confirm the same issue happens with 3.9.3 but not with 3.9.2!

My application utilizes threads for receiving messages from multiple topics. When I receive two messages almost simultaneously (below 8ms time difference), the application freezes on orjson.loads call. Freezes = thread that stucks on json.loads consumes 100% of CPU, and all other threads stop. Nothing happens at all - even prometheus_client is not providing any metrics, i.e., HTTP interface timeouts.

For troubleshooting, I used strace which shows only repeating lines:

futex(0x7f98d4338d74, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=79294, tv_nsec=4595386}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f98d4338d78, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f98d4338d74, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=79294, tv_nsec=9691159}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f98d4338d78, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f98d4338d74, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=79294, tv_nsec=14785932}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)

full strace: strace.txt

adlaube commented 1 year ago

+1 We suffered from the same symptoms in our cluster (orjson 3.9.4) and downgrading to 3.9.2 solved this issue for us. All application freezes we analyzed were caused by orjson.loads.

ijl commented 1 year ago

Thanks for the report. If you have time to produce a test case that would help. It could modify or fork from integration/wsgi.py or integration/thread.

What environment is this?

3.9.3 but not 3.9.2 suggests f94318a, e9b745e.

jarus commented 1 year ago

Hi @ijl,

I'm in the same team as @adlaube. First of all Thank you for orjson! We experienced a great speedup from using orjson and it is nowadays our main JSON implementation that we use within our applications.

Before the downgrade to orjson 3.9.2 (orjson-3.9.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl sha256:275b5a18fd9ed60b2720543d3ddac170051c43d680e47d04ff5203d2c6d8ebf1) we used orjson 3.9.4 (orjson-3.9.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl sha256:1e4b20164809b21966b63e063f894927bc85391e60d0a96fa0bb552090f1319c). In both cases we use it in CPython 3.11.3 runtime. Our application is a combination of Flask 2.3.2 + Cheroot 8.4.1 where HTTP requests and other async tasks are processed via Python multithreading and asyncio. We will try to provide a proper reproducible.

Beside the already mentioned application freezes, we also saw sporadic segfaults which seem to come from orjson but as also other C and Rust extensions are used in our application, we can't 100% confirm that yet. In case of application freezes we can prove that the application froze as the thread calling orjson didn't release the GIL anymore as it was stuck somewhere in orjson.

sirfz commented 1 year ago

Can confirm, experienced hanging in python 3.11 and segfaults in python 3.9. Downgrading to 3.9.2 as others mentioned resolved the issue.

ijl commented 1 year ago

3.9.5 removes the futex. I think going further requires a test exercising what you expect to work.

adlaube commented 1 year ago

Hi @ijl,

unfortunately I have not been able to reproduce this issue locally yet. Interestingly this issue did not occur in our test landscape indicating that a consistent high load is required. I have rebuilt 3.9.4 with debug symbols to produce some traces from our core dumps [0].

For future issues it might be helpful to include a minimal set of symbols: debug settings

Let me know if there is any further useful information I could be able to retrieve from the dumps for you. We will assess if the issue is still present in production with 3.9.5 and give you a feedback.

[0]

Case 0: hanging thread which holds the GIL

[Switching to thread 104 (Thread 0x7effcf7fe700 (LWP 118))]
#0  0x00007f01897c9876 in pool_malloc (size=8, ctx_ptr=<optimized out>) at include/yyjson/yyjson.c:1027
1027    include/yyjson/yyjson.c: No such file or directory.
#0  0x00007f01897c9876 in pool_malloc (size=8, ctx_ptr=<optimized out>) at include/yyjson/yyjson.c:1027
#1  pool_realloc (ctx_ptr=<optimized out>, ptr=0x7eff39072740, old_size=<optimized out>, size=8) at include/yyjson/yyjson.c:1115
#2  0x00007f01896e8700 in ?? ()
#3  0x00007f018e630e80 in ?? ()
#4  0x0000000000000000 in ?? ()
prev = <optimized out>
cur = 0x7effce7ff710
next = 0x7effce7fd050

Case 1: Segmentation fault

segfaults_orjson_3.9.4.txt

lejmr commented 1 year ago

Sadly for us, the bug is still present. I wrote a minimal test but didn't raise the issue... will give it another shot.

andersk commented 1 year ago

I can reproduce this crash in the Zulip development environment by reloading the page a few dozen times. I’m seeing similar symptoms: various segfaults and 100% CPU freezes. The environment is Python 3.8.10 on Ubuntu 20.04.6 LTS x86-64.

The first bad commit seems to be e9b745ecd39e2d353f9815367ee66909eec507e6. Moreover, I can reproduce the crash by cherry-picking only those buffer refactoring changes in deserialize_yyjson, YYJSON_ALLOC, yyjson_init onto 3.9.2 (here’s the exact code for that). Adding caae033ee1c1a0bb704643153f271a29d48fc28e doesn’t help.

Obviously Zulip is a pretty complicated application and far from an ideal minimal test case, but at least it’s open source and you can run it yourself; alternatively, I’m happy to try out any debugging suggestions you might have.

bdraco commented 1 year ago

We ended up downgrading to 3.9.2 in Home Assistant since we kept getting reports of crashes. Sadly I haven't been able to replicate it myself

andersk commented 1 year ago

I’ve been noticing by running debug builds with warnings enabled that the crash is always preceded by a bunch of warnings like this:

/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=86, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 51262), raddr=('127.0.0.1', 5672)>
  return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=85, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 53479), raddr=('127.0.0.1', 33606)>
  return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=84, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 33606), raddr=('127.0.0.1', 53479)>
  return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback

These warnings are not caused by orjson, of course—I see these same warnings on other random lines of code during normal execution. But they do suggest that the crash may be triggered when the Python garbage collector happens to run from inside orjson’s code.

This makes me suspect that orjson might be playing fast-and-loose with reference counting? Just from looking around at the Py_DECREF calls, patterns like this where a reference is decremented before being used seem very suspicious:

https://github.com/ijl/orjson/blob/d1cd27e29c8df2768be016071d0800a92d120786/src/serialize/dict.rs#L294-L298

The entire concept of py_decref_without_destroy is suspicious as well, since it breaks the invariant that live objects have a nonzero ob_refcnt.

I haven’t pinned this crash on any particular reference counting bug, but this genre of bug is the kind of thing that’s likely to produce rare crashes depending on when Python happens to invoke the garbage collector.

ijl commented 1 year ago

Those are CPython implementation details. PyObject_GetAttr() increfs when returning on an object that must already have a non-zero refcnt. insertdict() increfs, but the deserialized object was only created to populate the container. a52679acd34eef715c1932281170edb405dccca9 was a refcounting error.

ijl commented 1 year ago

3.9.6 was tagged and could be run against your test suite, possibly with debug-assertions = true , but I don't think anything relevant was changed.

andersk commented 1 year ago

Yeah, 3.9.6 still crashes. Here’s an interesting panic:

/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=60, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 41958), raddr=('127.0.0.1', 54115)>
  return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=61, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 54115), raddr=('127.0.0.1', 41958)>
  return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/srv/zulip/zerver/lib/message.py:307: ResourceWarning: unclosed <socket.socket fd=62, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 50170), raddr=('127.0.0.1', 5672)>
  return orjson.loads(zlib.decompress(message_bytes))
ResourceWarning: Enable tracemalloc to get the object allocation traceback
thread '<unnamed>' panicked at 'range start index 14 out of range for slice of length 13', /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/encoding_rs-0.8.33/src/mem.rs:419:30
stack backtrace:
   0: rust_begin_unwind
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/panicking.rs:67:14
   2: core::slice::index::slice_start_index_len_fail_rt
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/slice/index.rs:52:5
   3: core::slice::index::slice_start_index_len_fail
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/slice/index.rs:40:9
   4: orjson::str::create::unicode_from_str
   5: orjson::deserialize::yyjson::parse_yy_object
   6: orjson::deserialize::yyjson::parse_yy_array
   7: orjson::deserialize::yyjson::parse_yy_object
   8: orjson::deserialize::yyjson::deserialize_yyjson
   9: loads
  10: cfunction_vectorcall_O
             at ./build-debug/../Objects/methodobject.c:482:24

remaining 100 frames

``` 11: _PyObject_Vectorcall at ./build-debug/../Include/cpython/abstract.h:127:11 12: call_function at ./build-debug/../Python/ceval.c:4963:13 13: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3469:23 14: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 15: function_code_fastcall at ./build-debug/../Objects/call.c:284:14 16: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:411:20 17: _PyObject_Vectorcall at ./build-debug/../Include/cpython/abstract.h:127:11 18: call_function at ./build-debug/../Python/ceval.c:4963:13 19: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3500:19 20: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 21: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 22: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 23: _PyObject_Vectorcall at ./build-debug/../Include/cpython/abstract.h:127:11 24: call_function at ./build-debug/../Python/ceval.c:4963:13 25: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3500:19 26: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 27: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 28: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 29: _PyObject_Vectorcall at ./build-debug/../Include/cpython/abstract.h:127:11 30: call_function at ./build-debug/../Python/ceval.c:4963:13 31: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3515:19 32: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 33: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 34: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 35: _PyObject_Vectorcall at ./build-debug/../Include/cpython/abstract.h:127:11 36: call_function at ./build-debug/../Python/ceval.c:4963:13 37: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3515:19 38: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 39: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 40: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 41: PyVectorcall_Call at ./build-debug/../Objects/call.c:200:24 42: PyObject_Call at ./build-debug/../Objects/call.c:228:16 43: do_call_core at ./build-debug/../Python/ceval.c:5010:12 44: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3559:22 45: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 46: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 47: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 48: PyVectorcall_Call at ./build-debug/../Objects/call.c:200:24 49: PyObject_Call at ./build-debug/../Objects/call.c:228:16 50: do_call_core at ./build-debug/../Python/ceval.c:5010:12 51: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3559:22 52: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 53: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 54: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 55: PyVectorcall_Call at ./build-debug/../Objects/call.c:200:24 56: PyObject_Call at ./build-debug/../Objects/call.c:228:16 57: do_call_core at ./build-debug/../Python/ceval.c:5010:12 58: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3559:22 59: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 60: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 61: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 62: PyVectorcall_Call at ./build-debug/../Objects/call.c:200:24 63: PyObject_Call at ./build-debug/../Objects/call.c:228:16 64: do_call_core at ./build-debug/../Python/ceval.c:5010:12 65: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3559:22 66: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 67: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 68: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 69: PyVectorcall_Call at ./build-debug/../Objects/call.c:200:24 70: PyObject_Call at ./build-debug/../Objects/call.c:228:16 71: do_call_core at ./build-debug/../Python/ceval.c:5010:12 72: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3559:22 73: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 74: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 75: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 76: PyVectorcall_Call at ./build-debug/../Objects/call.c:200:24 77: PyObject_Call at ./build-debug/../Objects/call.c:228:16 78: do_call_core at ./build-debug/../Python/ceval.c:5010:12 79: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3559:22 80: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 81: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 82: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 83: PyVectorcall_Call at ./build-debug/../Objects/call.c:200:24 84: PyObject_Call at ./build-debug/../Objects/call.c:228:16 85: do_call_core at ./build-debug/../Python/ceval.c:5010:12 86: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3559:22 87: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 88: function_code_fastcall at ./build-debug/../Objects/call.c:284:14 89: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:411:20 90: _PyObject_Vectorcall at ./build-debug/../Include/cpython/abstract.h:127:11 91: method_vectorcall at ./build-debug/../Objects/classobject.c:60:18 92: _PyObject_Vectorcall at ./build-debug/../Include/cpython/abstract.h:127:11 93: call_function at ./build-debug/../Python/ceval.c:4963:13 94: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3500:19 95: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 96: _PyEval_EvalCodeWithName at ./build-debug/../Python/ceval.c:4298:14 97: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:436:12 98: _PyObject_Vectorcall at ./build-debug/../Include/cpython/abstract.h:127:11 99: call_function at ./build-debug/../Python/ceval.c:4963:13 100: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3469:23 101: PyEval_EvalFrameEx at ./build-debug/../Python/ceval.c:741:12 102: function_code_fastcall at ./build-debug/../Objects/call.c:284:14 103: _PyFunction_Vectorcall at ./build-debug/../Objects/call.c:411:20 104: _PyObject_FastCallDict at ./build-debug/../Objects/call.c:96:15 105: _PyObject_Call_Prepend at ./build-debug/../Objects/call.c:888:14 106: slot_tp_call at ./build-debug/../Objects/typeobject.c:6556:15 107: _PyObject_MakeTpCall at ./build-debug/../Objects/call.c:159:18 108: _PyObject_Vectorcall at ./build-debug/../Include/cpython/abstract.h:125:16 109: call_function at ./build-debug/../Python/ceval.c:4963:13 110: _PyEval_EvalFrameDefault at ./build-debug/../Python/ceval.c:3500:19 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. ```

andersk commented 1 year ago

Here’s a question. What happens if orjson.loads allocates a Python object that triggers a garbage collection that invokes a destructor that calls orjson.loads again? (Or the destructor releases the GIL and a different thread calls orjson.loads?) That reentrancy will corrupt the YYJSON_ALLOC buffer, right?

andersk commented 1 year ago

import orjson

class C:
    def __del__(self):
        orjson.loads('"' + "a" * 10000 + '"')

c = C()
c.c = c
del c

orjson.loads("[" + "[]," * 1000 + "[]]")

$ python reentrant.py 
Illegal instruction (core dumped)

andersk commented 1 year ago

Proposed fix:

427

ijl commented 1 year ago

This is fixed in 3.9.7. Thank you everyone for the logs, investigation, and fix.

hauntsaninja commented 1 year ago

Thank you ijl and andersk and everyone else! :-)

ijl / orjson

3.9.4 has a threading issue #415

427