python / cpython

The Python programming language
https://www.python.org
Other
63.85k stars 30.56k forks source link

Failed assertion in `Python/legacy_tracing.c:431` on a free-threading build #127235

Open devdanzin opened 1 week ago

devdanzin commented 1 week ago

Crash report

What happened?

It's possible to cause an abort on !_PyMem_IsPtrFreed(tstate) while running with PYTHON_GIL=0 by calling the following code:

import threading

def errback(*args, **kw):
    raise ValueError('error')

for x in range(200):
    threading._start_joinable_thread(errback)
    try:
        threading.setprofile_all_threads("")
    except:
        pass

Abort message:

python: Python/legacy_tracing.c:431: is_tstate_valid: Assertion `!_PyMem_IsPtrFreed(tstate)' failed.
Aborted

Not sure this isn't an expected failure mode. Found using fusil by @vstinner.

CPython versions tested on:

3.13, 3.14, CPython main branch

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.14.0a2+ experimental free-threading build (heads/main-dirty:a13e94d84bf, Nov 23 2024, 07:16:19) [GCC 11.4.0]

ZeroIntensity commented 1 week ago

Does this occur without use of the private _start_joinable_thread?

colesbury commented 1 week ago

I'm pretty sure that this is not limited to free threading. It's not safe to access other PyThreadStates without holding the runtime lock because those threads may exit. I think that we release the lock here because of reentrancy/deadlock concerns, but it's still not safe.

The reproducers involving the GIL are slightly more complicated because PyThreadState_DeleteCurrent is called with the GIL, which prevents most of these crashes. However, PyThreadState_Delete is called without the GIL and can lead to the crash.

https://github.com/python/cpython/blob/3e7ce6e9aed8616c8ce53eaef4279402d3ee38ec/Python/ceval.c#L2402-L2421

I think we have various reports of the same underlying issue. This issue is a nice reproducer for free threading, though.

For example:

ZeroIntensity commented 5 days ago

Ok, I've pondered this issue for a day or so and I think that the simplest solution for this specific case is to just wait for the per-interpreter thread state lock (GH-127115) to get merged. HEAD_LOCK tends to be annoyingly reentrant, so by just switching to that, we should be able to fix any unsafe usage because it shouldn't be nearly as reentrancy prone. (FWIW, I'm somewhat in favor of making the runtime lock recursive with how much of a PITA it is to use, but whatever.)

The much bigger issue at hand here is that PyThreadState_Next and friends are public APIs, so they aren't able to HEAD_LOCK. Somehow, we need to lock inside of those functions rather than relying on the caller to lock the runtime. I don't think there's anything we can do about PyThreadState_Next, because even if we somehow got thread states in a thread-safe way internally, and then magically made it safe to use the thread state without worrying about concurrent modifications (which would be impossible, unless we did something silly like copying the thread state and returning that), there's nothing stopping the list of thread states from being changed in between calls, so it could possibly skip over entries or return duplicates. Our best shot is probably to just deprecate PyThreadState_Next and invent something new and safer for users.

Some other functions we need to worry about in terms of thread safety:

For the future, I think we should probably steer clear of any support for using a detached thread state. Perhaps we could invent some new API (PyThreadStateView?) that has more clear thread safety requirements and push that for users.

TL;DR we probably need to document and deprecate some non-thread-safe thread state APIs.