emmettbutler commented 3 months ago

Bug report

Bug description:

A segmentation fault sometimes occurs in an application using Datadog's ddtrace-run 2.9.0, CPython 3.9, and sqlalchemy on ARM64. Our (@sanchda and my) analysis suggests that this is the result of a bug in CPython 3.9, which may have been fixed in CPython 3.12. We are unable to replicate this issue directly because it happens sporadically on infrastructure operated by a Datadog customer.

One instance of the segfault for which we managed to capture a core dump looks like this:

Current thread 0x0000ffff9a4915a0 (most recent call first):,1719935856047
"File ""/usr/local/lib/python3.9/site-packages/ddtrace/_trace/provider.py"", line 126 in activate",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/ddtrace/_trace/tracer.py"", line 807 in _start_span",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/ddtrace/_trace/tracer.py"", line 898 in trace",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/ddtrace/contrib/dbapi/__init__.py"", line 311 in _trace_method",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/ddtrace/contrib/dbapi/__init__.py"", line 335 in rollback",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py"", line 683 in do_rollback",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py"", line 1038 in _reset",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py"", line 763 in _finalize_fairy",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py"", line 1008 in _checkin",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py"", line 1166 in close",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py"", line 1251 in close",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py"", line 928 in close",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py"", line 846 in commit",1719935856047
"File ""/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py"", line 1454 in commit",1719935856047
# application stack frames....

The leaf node of this Python stacktrace is this call to contextvars.ContextVar().set(), which operates on a contextvar instance internal to Datadog code that's initialized here.

The core dump also includes this native backtrace:

*** Segmentation fault,1719935856047
Backtrace:,1719935856047
/lib/aarch64-linux-gnu/libc.so.6(+0x80a10)[0xffff99fb0a10],1719935856048
/lib/aarch64-linux-gnu/libc.so.6(gsignal+0x1c)[0xffff99f6a76c],1719935856048
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff9a49d850],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0xd2538)[0xffff9a1b2538],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0xd2ce0)[0xffff9a1b2ce0],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0xd3414)[0xffff9a1b3414],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0xcd82c)[0xffff9a1ad82c],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(PyContextVar_Set+0x10c)[0xffff9a1add0c],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x13a4c8)[0xffff9a21a4c8],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x137df0)[0xffff9a217df0],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x1540)[0xffff9a24ae20],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x137df0)[0xffff9a217df0],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x1540)[0xffff9a24ae20],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x137eec)[0xffff9a217eec],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x1be8)[0xffff9a24b4c8],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x137df0)[0xffff9a217df0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x3134)[0xffff9a24ca14],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x137df0)[0xffff9a217df0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x3134)[0xffff9a24ca14],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x1540)[0xffff9a24ae20],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x137df0)[0xffff9a217df0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x3134)[0xffff9a24ca14],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x137df0)[0xffff9a217df0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x3134)[0xffff9a24ca14],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x137df0)[0xffff9a217df0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x1540)[0xffff9a24ae20],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x6d4)[0xffff9a249fb4],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x3134)[0xffff9a24ca14],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyFunction_Vectorcall+0xb0)[0xffff9a215bd0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x135da0)[0xffff9a215da0],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalFrameDefault+0x548)[0xffff9a249e28],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x168c24)[0xffff9a248c24],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(_PyEval_EvalCodeWithName+0x64)[0xffff9a2aeb74],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(PyEval_EvalCodeEx+0x40)[0xffff9a2aeb00],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(PyEval_EvalCode+0x2c)[0xffff9a2aeaac],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x1e08fc)[0xffff9a2c08fc],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x1e0868)[0xffff9a2c0868],1719935856049
/usr/local/bin/../lib/libpython3.9.so.1.0(+0x1e0798)[0xffff9a2c0798],1719935856050
/usr/local/bin/../lib/libpython3.9.so.1.0(PyRun_SimpleFileExFlags+0x190)[0xffff9a2c03e4],1719935856050
/usr/local/bin/../lib/libpython3.9.so.1.0(Py_RunMain+0x3c4)[0xffff9a2c9624],1719935856050
/usr/local/bin/../lib/libpython3.9.so.1.0(Py_BytesMain+0x38)[0xffff9a2c9078],1719935856050
/lib/aarch64-linux-gnu/libc.so.6(+0x27780)[0xffff99f57780],1719935856050
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffff99f57858],1719935856050
/usr/local/bin/python3(_start+0x30)[0xaaaaad2308b0],1719935856050

These lines in particular warrant scrutiny:

/usr/local/bin/../lib/libpython3.9.so.1.0(+0xd2538)[0xffff9a1b2538],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0xd2ce0)[0xffff9a1b2ce0],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0xd3414)[0xffff9a1b3414],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(+0xcd82c)[0xffff9a1ad82c],1719935856048
/usr/local/bin/../lib/libpython3.9.so.1.0(PyContextVar_Set+0x10c)[0xffff9a1add0c],1719935856048

We grabbed the libpython3.9.so.1.0 from Dockerhub's python:3.9.19-bookworm, (arm64) image, version sha256:37823bd8e7266bac0399e3b55e67cfe00297686f09fe5cf73411ffe1d75fd93e. This .so file is bitwise-identical to the artifact used in the production environment that the core dump came from.

We used objdump to symbolize the containing functions within libpython3.9.so.1.0 and found the following:

0xd2538 -> hamt_node_bitmap_clone
0xd2ce0 -> hamt_node_bitmap_assoc
0xd3414 -> _PyHamt_Assoc
0xcd82c -> contextvar_set
0x10c   -> PyContextVar_set

The leaf of this call stack is defined in Py3.9 here.

ghidra gives us the following decompilation of the hamt_node_bitmap_clone function:

void hamt_node_bitmap_clone(long param_1)

{
  long lVar1;
  long lVar2;
  long *plVar3;
  long lVar4;

  lVar1 = hamt_node_bitmap_new(*(undefined8 *)(param_1 + 0x10));
  if (lVar1 != 0) {
    lVar4 = *(long *)(param_1 + 0x10);
    for (lVar2 = 0; lVar2 < lVar4; lVar2 = lVar2 + 1) {
      plVar3 = *(long **)(param_1 + 0x20 + lVar2 * 8);                     // 0xd2530
      if (plVar3 != (long *)0x0) {
        *plVar3 = *plVar3 + 1;
      }
      *(long **)(lVar1 + 0x20 + lVar2 * 8) = plVar3;
    }
    *(undefined4 *)(lVar1 + 0x18) = *(undefined4 *)(param_1 + 0x18);
  }
  return;
}

objdump tells us that the specific fault site is 0xd2538, which ghidra tells us is on the line

plVar3 = *(long **)(param_1 + 0x20 + lVar2 * 8);

The source code counterpart of this decompiled line is here, a call to Py_XINCREF.

ghidra gives the failing instructions from this line as

001d2530 a2 78 61 f8     ldr        x2,[x5, x1, LSL #0x3]
001d2534 82 00 00 b4     cbz        x2,LAB_001d2544
001d2538 43 00 40 f9     ldr        x3,[x2]               # this is the site of the fault
001d253c 63 04 00 91     add        x3,x3,#0x1

ldr: load the given value into the register x2.
cbz: conditional jump if zero. This is looking at x2 as a value, not as a pointer. It’s probably part of Py_XINCREF where it checks non-null. Since the next instruction is executed, x2 is NOT null.
ldr: de-references x2 and puts it in x3. x2 is not NULL (we know that because the computation on the ldr line is an iteration) So, x2 is an invalid address
add: this is just the increment operation from the Py_XINCREF

In terms of the source code, this suggests that node->b_array[i] is an invalid nonzero pointer here.

Some speculative reasons why this could happen:

The given item was cleaned up and is no longer valid, although sometimes it is valid because it is freed but not unmapped This might explain why the problem occurs on arm but not x86, since the mapping behavior at the level of the allocator may just so happen to be different for system reasons
The size of the array is incorrect
The item is uninitialized

In this case, the "item" is a HAMT node underlying a ContextVar instance being set() during a sqlalchemy rollback operation.

Aside from this analysis, another reason we're opening this issue on CPython is that the Py_XINCREF line in question was changed since Python 3.9. #99317 replaced it with a call to Py_XNewRef here. While we haven't tested this against Python 3.12 (the first to include the change), the presence of this change suggests that a bug may have been fixed in newer versions that could be worth backporting. Even without a backport, confirmation that this is or was a real issue in CPython itself would be helpful.

CPython versions tested on:

3.9

Operating systems tested on:

Linux

ZeroIntensity commented 3 months ago

Py_XNewRef and Py_XINCREF are very similar, so I doubt that's the problem. Py_XINCREF simply increments the reference count on the object if the pointer was non-NULL, and Py_XNewRef does the exact same, but returns the passed pointer.

At a very speculative glance, this seems like it could be a problem related to an object getting freed, and then trying to increment its reference count (which would likely be a problem outside of CPython, not in ContextVars). Though, it's very difficult to tell without a reproducer. Could you try reproducing the segfault with PYTHONMALLOC=malloc enabled? That should give a little more insight into what's going on with the memory.

picnixz commented 3 months ago

One thing that is certain is that we will not backport anything that is not related to security bugs beyond 3.12. So, unless it's a security issue (which I doubt it is according to the report), the (apparent) fix won't be backported (I doubt the patch would have changed anything and I suspect it's something else.

cc @1st1 as the author of PEP 567

vstinner commented 3 months ago

Our (@sanchda and my) analysis suggests that this is the result of a bug in CPython 3.9, which may have been fixed in CPython 3.12.

If you cannot reproduce the issue on Python 3.12, I suggest to close the issue, since Python 3.9 no longer accept bugfixes: https://devguide.python.org/versions/ (as @picnixz wrote).

picnixz commented 3 months ago

(Closing as wontfix since it affects a security-only branch)

emmettbutler commented 2 months ago

The customer reports that setting PYTHONMALLOC=malloc has caused the issue to stop occurring.

ZeroIntensity commented 2 months ago

Well... that's a little bit odd. Things shouldn't segfault on pymalloc but work on malloc

sanchda commented 2 months ago

Things shouldn't segfault on pymalloc but work on malloc

I think the underlying defect is present regardless of the allocator, but the segmentation fault is more likely to be observed with malloc than pymalloc because the presentation of the issue is very sensitive to the precise behavior of the libc allocator.

I believe your original intuition:

object getting freed, and then trying to increment its reference count

holds in this case. Fundamentally speaking, the increment-after-free is an invalid operation, but it doesn't necessarily convert into a segfault unless the de-referenced address isn't mapped. With PYTHONMALLOC=malloc, we get small objects crammed into the malloc arenas, which is probably making it harder for the allocator to munmap() those arenas. There's wrong behavior happening all over the place, probably, but it's not turning into segfaults.

Anyway, I think all signs point to this being a refcount issue in the user code.

python / cpython

segmentation fault in PyContextVar_set under py3.9 ARM64 #122529

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on: