Closed emmettbutler closed 3 months ago
Py_XNewRef
and Py_XINCREF
are very similar, so I doubt that's the problem. Py_XINCREF
simply increments the reference count on the object if the pointer was non-NULL
, and Py_XNewRef
does the exact same, but returns the passed pointer.
At a very speculative glance, this seems like it could be a problem related to an object getting freed, and then trying to increment its reference count (which would likely be a problem outside of CPython, not in ContextVars). Though, it's very difficult to tell without a reproducer. Could you try reproducing the segfault with PYTHONMALLOC=malloc
enabled? That should give a little more insight into what's going on with the memory.
One thing that is certain is that we will not backport anything that is not related to security bugs beyond 3.12. So, unless it's a security issue (which I doubt it is according to the report), the (apparent) fix won't be backported (I doubt the patch would have changed anything and I suspect it's something else.
cc @1st1 as the author of PEP 567
Our (@sanchda and my) analysis suggests that this is the result of a bug in CPython 3.9, which may have been fixed in CPython 3.12.
If you cannot reproduce the issue on Python 3.12, I suggest to close the issue, since Python 3.9 no longer accept bugfixes: https://devguide.python.org/versions/ (as @picnixz wrote).
(Closing as wontfix
since it affects a security-only branch)
The customer reports that setting PYTHONMALLOC=malloc
has caused the issue to stop occurring.
Well... that's a little bit odd. Things shouldn't segfault on pymalloc but work on malloc
Things shouldn't segfault on pymalloc but work on malloc
I think the underlying defect is present regardless of the allocator, but the segmentation fault is more likely to be observed with malloc
than pymalloc
because the presentation of the issue is very sensitive to the precise behavior of the libc allocator.
I believe your original intuition:
object getting freed, and then trying to increment its reference count
holds in this case. Fundamentally speaking, the increment-after-free is an invalid operation, but it doesn't necessarily convert into a segfault unless the de-referenced address isn't mapped. With PYTHONMALLOC=malloc
, we get small objects crammed into the malloc arenas, which is probably making it harder for the allocator to munmap()
those arenas. There's wrong behavior happening all over the place, probably, but it's not turning into segfaults.
Anyway, I think all signs point to this being a refcount issue in the user code.
Bug report
Bug description:
A segmentation fault sometimes occurs in an application using Datadog's
ddtrace-run
2.9.0, CPython 3.9, and sqlalchemy on ARM64. Our (@sanchda and my) analysis suggests that this is the result of a bug in CPython 3.9, which may have been fixed in CPython 3.12. We are unable to replicate this issue directly because it happens sporadically on infrastructure operated by a Datadog customer.One instance of the segfault for which we managed to capture a core dump looks like this:
The leaf node of this Python stacktrace is this call to
contextvars.ContextVar().set()
, which operates on a contextvar instance internal to Datadog code that's initialized here.The core dump also includes this native backtrace:
These lines in particular warrant scrutiny:
We grabbed the
libpython3.9.so.1.0
from Dockerhub'spython:3.9.19-bookworm, (arm64)
image, versionsha256:37823bd8e7266bac0399e3b55e67cfe00297686f09fe5cf73411ffe1d75fd93e
. This .so file is bitwise-identical to the artifact used in the production environment that the core dump came from.We used
objdump
to symbolize the containing functions withinlibpython3.9.so.1.0
and found the following:The leaf of this call stack is defined in Py3.9 here.
ghidra
gives us the following decompilation of thehamt_node_bitmap_clone
function:objdump
tells us that the specific fault site is0xd2538
, which ghidra tells us is on the lineThe source code counterpart of this decompiled line is here, a call to
Py_XINCREF
.ghidra
gives the failing instructions from this line asldr
: load the given value into the register x2.cbz
: conditional jump if zero. This is looking at x2 as a value, not as a pointer. It’s probably part ofPy_XINCREF
where it checks non-null. Since the next instruction is executed, x2 is NOT null.ldr
: de-references x2 and puts it in x3. x2 is not NULL (we know that because the computation on the ldr line is an iteration) So, x2 is an invalid addressadd
: this is just the increment operation from thePy_XINCREF
In terms of the source code, this suggests that
node->b_array[i]
is an invalid nonzero pointer here.Some speculative reasons why this could happen:
In this case, the "item" is a HAMT node underlying a
ContextVar
instance beingset()
during a sqlalchemy rollback operation.Aside from this analysis, another reason we're opening this issue on CPython is that the
Py_XINCREF
line in question was changed since Python 3.9. #99317 replaced it with a call toPy_XNewRef
here. While we haven't tested this against Python 3.12 (the first to include the change), the presence of this change suggests that a bug may have been fixed in newer versions that could be worth backporting. Even without a backport, confirmation that this is or was a real issue in CPython itself would be helpful.CPython versions tested on:
3.9
Operating systems tested on:
Linux