Open bruxisma opened 3 months ago
Can you reproduce this behaviour on Linux? or MacOS? Or what happens if you sleep a bit before spawning the sub-interpreters? I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/ Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available)
At a glance, this could possibly be related to a memory allocation failure or something similar.
auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals);
These lines need a NULL
check if an error occurred, otherwise there will be a NULL
pointer dereference in the call to Py_DecRef
.
At a glance, this could possibly be related to a memory allocation failure or something similar.
auto globals = PyDict_New(); auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input); auto result = PyEval_EvalCode(code, globals, globals);
These lines need a
NULL
check if an error occurred, otherwise there will be aNULL
pointer dereference in the call toPy_DecRef
.
Py_DecRef
is the function version of Py_XDECREF
. Though I can see the confusion between Py_DecRef
and Py_DECREF
. If there was a nullptr
, we'd have a different set of errors, and the debug memory assertions wouldn't be complaining that all memory was not written with the FORBIDDENBYTE
value.
Can you reproduce this behaviour on Linux? or MacOS? I will work on getting the Linux/macOS behaviors checked when I can. My WSL2 environment is woefully out of date 😅. macOS might be more difficult, as I'm unaware of how the
_d
version of thePython.framework
might be linked against.Or what happens if you sleep a bit before spawning the sub-interpreters?
Sleeping might fix it, but might also hide the issue. The goal here was just to run a bunch of interpreters, not sleep for the sake of working around possible synchronization issues
I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/
This is not how threads or operating systems work. There is a synchronization issue or an ABA problem with the debug memory assertions in the C API's internals
Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available)
I will also work on giving this a go.
Py_DecRef
is the function version ofPy_XDECREF
Oh! TIL. I still would suggest adding a NULL
check -- an error indicator could be set, which would break subsequent calls to PyErr_Occurred
.
The error is outside of the Python C API as one might usually interact with it. i.e., only on subinterpreter initialization (i.e., creating a PyThreadState
) or on subinterpreter destruction (Py_EndInterpreter
) does the error occur (typically the latter). PyErr_Occurred
is not being called anywhere in the MRE, nor in the actual traceback for the failure.
Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific.
@picnixz I can confirm the behavior still exists on Windows on HEAD
. Same errors can occur
Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific.
PyErr_Occurred
only performs a read on the current thread state object. If an error is present when it shouldn't be that still points to a memory corruption error occurring between the destruction and initialization of a PyThreadState
.
Pleased to back up @ZeroIntensity and say that this appears to be a Windows specific issue. I was unable to replicate this memory assertion error on Linux.
However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place.
WARNING: ThreadSanitizer: data race (pid=11259)
Write of size 8 at 0x7f24e35e16c0 by thread T10:
#0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
#3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
#4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)
...
Previous write of size 8 at 0x7f24e35e16c0 by thread T18:
#0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
#3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
#4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)
Location is global '??' at 0x7f24e2e85000 (libpython3.12.so.1.0+0x75c880)
This only happened 3 times, and around the 42 thread generation step on average.
However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place.
A quick look suggests this TSan failure is a local data-race to posixmodule.c: There appears to be no protection from multiple threads importing posixmodule simultaneously, and hence multiple threads trying to simultaneously qsort in-place one of the three static tables.
This module is used on Windows, but I'm not sure if any of the macros enabling those qsort
calls (HAVE_FPATHCONF
, HAVE_PATHCONF
, HAVE_CONFSTR
, or HAVE_SYSCONF
) are enabled in the Windows build. My guess would be no...
I don't know if overlapping qsorts on the same array will actually corrupt memory, but I doubt there's any guarantee it won't. (qsort_r
doesn't help, that makes the comparison function thread-safe, not the array being sorted, according to the manpage.)
So probably a separate issue, and a more-focused/reliable MRE could be made for it.
To think of it, I'm not even sure subinterpreters are thread safe in the first place, you might need to acquire the GIL. Though, I'm not all too familiar with them.
cc @ericsnowcurrently for insight
I think the point of PEP-684 (in Python 3.12) is that you can have a per-subinterpreter GIL, as is being done in this test-case. posixmodule.c claims compatibility with the per-interpreter GIL, so in that case I think it's a straight-up bug.
The check_multi_interp_extensions
flag has also been set, so any modules known to not work with per-subinterpreter GIL should refuse to import before running any init code anyway.
Edit: Oops, perhaps I misunderstood your comment, and it wasn't directed as the posixmodule issue.
Indeed, calling Py_NewInterpreterFromConfig
documents that the GIL must be held for that call, and the own-GIL case does interesting things with the GIL-holding state of both new and current interpreter.
Like all other Python/C API functions, the global interpreter lock must be held before calling this function and is still held when it returns. Likewise a current thread state must be set on entry. On success, the returned thread state will be set as current. If the sub-interpreter is created with its own GIL then the GIL of the calling interpreter will be released. When the function returns, the new interpreter’s GIL will be held by the current thread and the previously interpreter’s GIL will remain released here.
So I'm not clear if the given example is doing the right thing here or not, but I suspect the first thing the tasks should do is take the main interpreter's GIL if not already holding it. (And the call to Py_Finalize
should do that too, in that case). Poking around; because it's a fresh thread, maybe it doesn't need the main GIL to call Py_NewInterpreterFromConfig
, and so this code is GIL-safe as is. I'm not clear if the new per-subinterpreter-GIL support changed things around here, and if so, how it changed.
I'm guessing that's the problem Judging by the comment, it looks like new_interpreter
doesn't support being called without the GIL: https://github.com/python/cpython/blob/bffed80230f2617de2ee02bd4bdded1024234dab/Python/pylifecycle.c#L2261-L2263
I could see this being a bug, though. I'll wait for Eric to clarify.
Well this definitely brings me some concern then, as the GIL can be made optional in 3.13 😅
Does this occur on the free threaded build?
Sorry for the delay. Had a busy work week. I'll gather this info this weekend, though the code seems to indicate "yes" from a glance.
@ZeroIntensity I can confirm the issue still happens when the GIL is disabled in 3.13 and 3.14 HEAD
For now, I think I'm going to write this off as user-error, as Py_NewInterpreterFromConfig
(and friends) doesn't seem to support being called without the GIL. Though, it's a bit confusing as to why this only fails on Windows. Does it still occur if you call PyGILState_Ensure
(and subsequently PyGILState_Release
) before initializing the subinterpreter?
How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created?
You cannot call PyGILState_Release
on these threads because when you call Py_NewInterpreterFromConfig
it sets the new interpreter as the current one for said thread. This would require swapping to the old state to unlock it and then swapping back to the new state.
How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created?
AFAIK, PyGILState_*
calls don't actually acquire the GIL on free-threaded builds, they just initialize the thread state structure to allow calling the API.
At a glance, the drop-in code would be:
void execute () {
std::vector<std::jthread> tasks { };
tasks.reserve(MAX_STATES);
for (auto count = 0zu; count < tasks.capacity(); count++) {
std::println("Generating thread state {}", count);
tasks.emplace_back([count] {
PyGILState_STATE gil = PyGILState_Ensure(); // Make sure we can call the API
if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
std::println("Failed to initialize thread state {}", count);
PyGILState_Release(gil);
return;
}
auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals);
Py_DecRef(result);
Py_DecRef(code);
Py_DecRef(globals);
Py_EndInterpreter(state);
state = nullptr;
PyGILState_Release(gil); // Release the GIL (only on non-free-threaded builds)
});
}
}
There needs to be a Py_BEGIN_ALLOW_THREADS
in there, though. Normally, that would happen over the calls to join the threads, but I'm assuming that happens in a destructor here -- I'm not too familiar with C++ threading, you'll have to figure out where to put that.
It shouldn't require any of that. The interpreter variable is local to the thread. The Py_BEGIN_ALLOW_THREADS
macro is required when manipulating the GIL. There is no GIL manipulation needed because each of these are isolated threads running on their own, and even when the GIL is disabled there shouldn't be any manipulation of the GIL occurring regardless because it's completely turned off
The need for PyGILState_STATE
goes away when using isolated subinterpreters. The state is only needed if you're executing these subinterpreters to talk to each other. But these interpreters are running entirely on their own with no interaction between each other. Were I trying to register additional threads under a specific interpreter instance, then your code would be warranted. However, the documentation for these functions clearly says
Note that the PyGILState_* functions assume there is only one global interpreter (created automatically by Py_Initialize()). Python supports the creation of additional interpreters (using Py_NewInterpreter()), but *mixing multiple interpreters and the PyGILState_ API is unsupported.
(emphasis mine)
So, either the documentation is wrong, or Python's C API is wrong. Either way, there is a bug here.
It shouldn't require any of that. The interpreter variable is local to the thread.
Yes, but how do you initialize that interpreter variable? I don't think new_interpreter
does it (unless I'm missing something -- please clarify!)
I think you're misinterpreting what I'm saying: you're right, you don't need to mess with the GIL when using the subinterpreter, but it seems (based on the code from new_interpreter
) that you need to hold it in order to create a subinterpreter. But that could be wrong, I'm just going off a comment. I'm still not sure why it only happens after a certain number of threads -- it should happen on the first thread, and it certainly shouldn't be Windows-specific.
yes it is initialized. that's what auto status = Py_NewInterpreterFromConfig(&state, &config);
does. If it didn't work status
would be an error.
I'm very certain this is to do with memory corruption and memory blocks within the debug assertions not being locked or cleared out properly which is leading to this invalid state occurring. Why it only happens on Windows most likely has to do with behavior regarding the MSVC runtime in some capacity, and some behavioral difference with windows memory pages and the virtual allocator versus posix platforms surfacing.
That's the best I've been able to intuit, as the crash is indeterminate in nature, though as I stated in an earlier comment ThreadSanitizer
found some race conditions brought on by qsort
on Linux.
If it didn't work
status
would be an error.
It would probably segfault instead, FWIW -- C API calls don't have an extra check for the interpreter or thread state. I don't see where in new_interpreter
it initializes the thread state information, so I would think that the failure would be in _PyThreadState_GET
. However, that doesn't seem like that's the case, somehow.
Speculatively, the race could be caused by here https://github.com/python/cpython/blob/bffed80230f2617de2ee02bd4bdded1024234dab/Python/pylifecycle.c#L2251
Trying to set this across multiple threads could be problematic, since there's no lock.
If any of those failures were happening there, this same behavior would be observable in a release build. This situation only happens when the _d
python library is used.
The whole issue overall screams data race to me, it's quite odd that this only occurs on Windows debug builds. I would guess that this is present in release builds as well, just maybe not with this reproducer. Regardless, I'm like 70% sure that a lock should be acquired when writing to the runtime state, through HEAD_LOCK
. It might not be the cause of this issue, but that looks like a bug to me nonetheless. Does adding HEAD_LOCK
and HEAD_UNLOCK
to new_intepreter
fix the problem?
Having a quick poke around (in 3.13 on GitHub): new_interpreter
initialises the PyThreadState**
passed in, or nulls it on failure. It appears fully-prepared to handle a null current-thread state and "new interpreter has its own GIL" state (although the comments for this logic appear to still assume that this only happens in the first-time "main interpreter" case) and HEAD_LOCK
is used around the logic that adds the new interpreter to the global list of interpreters. I don't think that single write to disable GIL checking needs a lock, if this was going wrong somehow,
Basically, the things that PyGILState_Ensure
does to make a thread Python-enabled are also done by new_interpreter
. There's a call to bind_gilstate_tstate
in both paths which stores the tstate into the TLS variable which the PyGILState_*
API relies upon.
So I don't think the original code breaks the actual preconditions, at least that I noticed in browsing, and hence the docs that suggest a GIL must exist for calls to Py_NewInterpreterFromConfig
probably need to be clarified to note that this only applies to Python-enabled threads, and that on a "Non-Python created" thread, you don't need a current thread state, and actually can't have one, as the non-existence of a thread state is what defines a "non-Python created" thread.
The comment "but mixing multiple interpreters and the PyGILState_*
API is unsupported." is also a "here be dragons" coarse version of the clearer comment in the per-interpreter GIL docs:
Also note that combining this functionality with
PyGILState_*
APIs is delicate, because these APIs assume a bijection between Python thread states and OS-level threads, an assumption broken by the presence of sub-interpreters. It is highly recommended that you don’t switch sub-interpreters between a pair of matchingPyGILState_Ensure()
andPyGILState_Release()
calls. Furthermore, extensions (such asctypes
) using these APIs to allow calling of Python code from non-Python created threads will probably be broken when using sub-interpreters.
The converse of that last paragraph is that PyGILState_*
should work fine when you are tying sub-interpreters to OS-level threads, as seen in this example. So we're not even in conflict with the docs, there's just some parts of the documentation that don't explicitly take account of the "one-(sub)interpreter-per-thread" case.
https://github.com/python/cpython/issues/59956#issuecomment-1749152745 seems to agree with the above, as it notes that PyGILState_Ensure()
should no longer crash or do the wrong thing with sub-interpreters, but does not yet support controlling which interpreter is used when it creates a new thread state, that always uses the main interpreter. That wouldn't be a problem in the structure used here, as Py_NewInterpreterFromConfig
is creating the new thread state, and then any further theoretical PyGILState_*
calls are safe and will use that interpreter, from the current threadstate which points to it. So I don't expect problems from extension modules manipulating the GIL in this structure either, unless they are spinning up their own threads and calling PyGILState_*
APIs. (That's #59956's decade-old unresolved use-case, so it's not new with any of the recent changes.)
That said, adding PyGILState_Ensure()
before Py_NewInterpreterFromConfig
would not be incorrect (just wasteful...), and if that fixed things, would help point towards the source of the issue. (Conversely, if it doesn't help, that probably rules out a class of issues). Py_NewInterpreterFromConfig
is documented as releasing the GIL held when it was entered so no matching PyGILState_Release()
or Py_<VERB>_ALLOW_THREADS
calls should be needed. I couldn't immediately see where this was done from new_interpreter
though, so I'm trusting the docs on that one.
After some debugging, it seems I was wrong! HEAD_LOCK
and PyGILState_Ensure
don't seem to fix the data race on Linux.
Looking at the memory dump data from p=00000267272D6030
, the pre-buffer bytes appear to be E8 03 00 00 00 00 00 00 70 a0 00 00 fd fd fd fd
. That should be a big-endian size_t "bytes-allocated", then one of the chars 'r', 'm', or 'o', followed by seven 0xfd.
Instead it looks like we got overwritten with a 32-bit little-endian 0x00000000000003e8 (1000) followed by 16-bit 0x0000a70 (2672). (That first one could also be 16-bit 1000 followed by 16-bit 0, no way to tell. We don't see the preceding bytes either. They don't look like ASCII to me, anyway.) The 1000 is nice and round, but neither one jumps out with any idea of what might be overwriting that data.
So no obvious clue as to whether this only happens in the debug build, or if it's only detected in the debug build. In release builds there's no bookkeeping in live blocks or release-time validity checking, so if the overwrite is happening in non-debug builds we may never see evidence for it. The first thing in a PyObject is the ssize_t ref count, so in release-time builds, an overwrite like this would prevent the overwritten object from being deallocated, but not otherwise mess with anything off-hand. (GIL-disabled builds look like they would be more-damaged by overwriting the first uintptr_t
of the object, but it may not be visible depending on the overwrite and the structure of that uintptr_t
.)
Since TSAN isn't flagging this, I suspect these are overlapping objects from the same thread: an object is either being allocated from the wrong pool, or is writing past its end-bounds and hitting the next object in the pool. (Perhaps the damaged object is raw-alloc'd? I don't know if I'd expect it to be 16-bit aligned, but maybe I'm wrong to assume a malloc
implementation would generally provide size_t-aligned blocks. The object allocator does return 16-bit aligned blocks though.)
Anyway, I expect if you can catch the failure in a debugger, you could walk backwards a bit in memory and find the actual object causing the problem, if that is a PyObject
, and it's not itself corrupted. Since this appears to be being triggered by the sub-interpreters, I wonder if it's something in the "non-main obmalloc" support, e.g. one of the on-heap obmalloc support structures, that in the main obmalloc live inside a static object.
The preceding object probably isn't a PyObject
, since that should end with a size_t of 0xfd and we can only see 4 0xfd in the dump (if the two paddings overlapped). Since the corrupted object is being deallocated, its ref-count wasn't overwritten with another four 0xfd bytes. (Or both objects are live and are busy overwriting each other's padding in turn in a nightmare of undebuggability... indebuggability... one of those must be a real word? Anyway, I think this shouldn't happen in the same thread, so barring more evidence, not worrying too much about that case.)
Which leans me harder towards "one of the obmalloc tracking structures on the heap blows out its bounds and corrupts the first-allocated PyObject from the first-created pool, which immediately follows it". That points the finger at pool_header
, I think, and it helpfully ends with three uints. But I have only skimmed this code, and those are always on the heap anyway, which fails an earlier assumption I made to get this this point. And the actual values observed don't really fit those last three uints anyway. And 0x00000267272D6030 (minus two size_t's) doesn't seem right for the first block in the pool, which is a 16kiB-aligned block with 4 pointers and then 4 uints in the header structure. So for this to be the result, means there's something badly wrong with obmalloc already, and this has just revealed it.
So maybe a dead-end after all.
Is it possible to reproduce this failure with use_main_obmalloc
set to true, in the freethreaded build? Or is the "Own GIL requires non-main obmalloc" rule still enforced in that case?
After some debugging, it seems I was wrong!
HEAD_LOCK
andPyGILState_Ensure
don't seem to fix the data race on Linux.
Wait, do you have a repro-case on Linux? Or were you looking at the TSan failure? (If the latter, put a critical section of some kind around the qsort calls and it should be fixed. Not the best fix, but at least proves or disproves the point.)
I reproduced using the MSVC++ 17.11 Community edition compiler (under VS Code) and the clearest failure is
HEAP[463-interpreters.exe]: Invalid address specified to RtlValidateHeap( 000001F0B27D0000, 000001F0BA20AE30 )
from a PyObject_Free
call, so far I've noticed it inside dict resize and list resize operations during initial "site" import during new_interpreter
. The first parameter is the heap address, the second is the address being checked. I am not sure, but I believe it's failing because that address was not allocated from the heap in the first place. (The pointer passed to free
is actually 0x30 later than the above address, so I assume that's MSVC heap-management metadata before the allocation; if this wasn't a heap allocation, then it's nonsense.)
Anyway, what I guess is going on is that the object is being free'd in the wrong thread, so when _PyObjectFree
asks pymalloc_free
to free the object, the latter comes back with "not my allocation", and so it's passed to PyMem_RawFree
, which if not stopped by RtlValidateHeap
as seen here, will presumably end up back on the RtlHeap free-list and (in debug builds) have its management data blatted and possibly be given back out later for someone else to stomp on as a new heap allocation. (That blatting is why I suspect this only reproduces in debug builds, as this program may not run long enough to get that block back as a new allocation.)
The free'd object is generally on-the-way-out, in the list and dict resize operations it was the "oldXXX" pointer.
I expect that the reason RtlValidateHeap
fails is that this address is inside a heap allocation being used for an arena, not the start of the allocation. This repro seems happen to occur reliably with only 50 threads, and I saw it with 20 threads but not reliably, and was unable to reproduce it with 15 threads on my 16-core NUC, but I didn't try really hard. Sometimes it misses the RtlValidateHeap
but fails _PyMem_DebugCheckAddress
, suggesting that RtlValidateHeap
isn't perfect.
I just now had a RtlValidateHeap
failure happen in PyThread_free_lock
twice, which is actually supposed to be a direct raw free, not a pool object being accidentally passed to _PyMem_RawFree
, so I'm not sure why that failed. Even if that was freed from the wrong thread or wrong interpreter, it shouldn't have been an issue.
Edit: Thinking about this, I realised (and tested) that the repro does not require any calls between Py_NewInterpreterFromConfig
and Py_EndInterpreter
.
Edit again (last one tonight):
I think the following is the most-isolated example of a failure case:
marshal.c:1830 (Python 3.12)
if ((rf.refs = PyList_New(0)) == NULL)
return NULL;
result = read_object(&rf);
Py_DECREF(rf.refs);
The file itself is 44kb, in case it turns out to be something about a particular file. Although I hit it again with a smaller buffer (and smaller object being freed) so probably not.
Inside a block of C code, we create a new list, stash it somewhere not visible from anywhere else, pass it into a function, and then free it.
During the free, I'm seeing the list had 465 items in it, and the failure came when decrefing item 128 (going from the back). I don't know what type that item had, as its data was already blatted by the raw-free logic after passing _PyMem_DebugCheckAddress
(so it wasn't itself blatantly corrupted), however I could see that it used PyObject_Free
in its type's tp_free
member, and it was an 1183-byte object, so I assume it's a var-size object that went through a realloc itself somewhere, but is not itself a container of PyObjects. (Maybe a string or buffer? It's importing from codecs or io-related modules in the marshal.c repro case instance AFAICT.)
Given the size, it would have not been in the pool allocator but should have fallen through to heap allocation, so I'm now pretty doubtful about my earlier "freeing in wrong obmalloc" theory, and unsure what exactly RtlValidateHeap
is complaining about.
That said, since this is an unmarshall, I believe it can reference things that already exist, so maybe it's making itself visible to another thread by this mechanism?
I'm wondering if the Linux and Windows parts are separate issues. The data race occurs on just two threads and seems to happen in Py_NewInterpreterFromConfig
, not a deallocator.
@ZeroIntensity
If you're talking about the TSan failure seen in posixmodule on Linux, then yes, that's a separate issue revealed by the same code. It's trivially a static array being passed to qsort
from multiple threads at once, with no synchronisation. The fix used in similar other cases (search for qsort in the bug tracker) was to make the lists correctly-sorted in the source code rather than sorting during init.
I'll create a separate issue later today then, thanks!
Okay, I have a more-isolated example which points to maybe there just being a nasty bug in the debug allocator support somewhere.
I just got the RtlValidateHeap
failure in _extensions_cache_get
on the PyMem_RawFree(key);
call:
static PyModuleDef *
_extensions_cache_get(PyObject *filename, PyObject *name)
{
PyModuleDef *def = NULL;
void *key = NULL;
extensions_lock_acquire();
if (EXTENSIONS.hashtable == NULL) {
goto finally;
}
key = hashtable_key_from_2_strings(filename, name, HTSEP);
if (key == NULL) {
goto finally;
}
_Py_hashtable_entry_t *entry = _Py_hashtable_get_entry(
EXTENSIONS.hashtable, key);
if (entry == NULL) {
goto finally;
}
def = (PyModuleDef *)entry->value;
finally:
extensions_lock_release();
if (key != NULL) {
PyMem_RawFree(key);
}
return def;
}
That key
value comes from hashtable_key_from_2_strings
which allocated the key with char *key = PyMem_RawMalloc(size);
. Everything except the PyMem_RawFree
call is inside the lock, and the binary data at the address specified by key
is as below: the _codecs:_codecs
line is the data at key
, the other lines show the surrounding hex data:
28 00 00 00 00 00 00 00 9d 6e 00 00 fd fd fd fd
00 00 00 00 00 00 00 10 72 fd fd fd fd fd fd fd
_codecs:_codecs
fd fd fd fd fd fd fd fd fd fd fd fd 2e 00 2a 00
Everything except the PyMem_RawFree
happened inside the extension lock, which is
PyThread_acquire_lock(_PyRuntime.imports.extensions.mutex, WAIT_LOCK);
and
PyThread_release_lock(_PyRuntime.imports.extensions.mutex);
i.e., it's global, not per-interpreter.
Oh, erk. Looking at the stack trace, it ends with:
<crt stuff...>
python312_d.dll!_PyMem_RawFree(void * _unused_ctx, void * ptr) Line 73 (c:\Users\paulh\OneDrive\Documents\Python\cpython\Objects\obmalloc.c:73)
python312_d.dll!PyMem_RawFree(void * ptr) Line 685 (c:\Users\paulh\OneDrive\Documents\Python\cpython\Objects\obmalloc.c:685)
python312_d.dll!_extensions_cache_get(_object * filename, _object * name) Line 988 (c:\Users\paulh\OneDrive\Documents\Python\cpython\Python\import.c:988)
but that's wrong. PyMem_RawFree
is supposed to call _PyMem_DebugRawFree
, not _PyMem_RawFree
. _PyMem_RawFree
is literally just free
, but _PyMem_DebugRawFree
first walks backwards two words, calls _PyMem_DebugCheckAddress
and passes that new address to PyMem_RawFree
.
And we can see in the memory dump that the _PyMem_DebugMalloc
tracking data is correctly present (prefix big-ending size_t 0x10, then r
, then 3 0xfd; suffix is 8 0xfd's). You can then see the MSVC debug heap guards around that (4 0xfd's either side, and I assume that 0x28 at the start is the MSVC-visible allocation size: 0x10 for the Python data plus 3 size_t's for the _PyMem_DebugMalloc
tracking data)
So in short, I think the underlying issue is that something is causing PyMem_RawFree
to call _PyMem_RawFree
instead of _PyMem_DebugRawFree
and that's corrupting the heap.
#define _PyMem_Raw (_PyRuntime.allocators.standard.raw)
//...
void PyMem_RawFree(void *ptr)
{
_PyMem_Raw.free(_PyMem_Raw.ctx, ptr);
}
I confirmed with a watch in VSCode that _PyRuntime.allocators.standard.raw.free
is 0x00007fff01e32ca7 {python312_d.dll!_PyMem_DebugRawFree}
, so the right version of _pymem_allocators_standard_INIT
appears to be being executed (and clearly at allocation time we got the right allocator, but I'm wondering if a bug in another thread temporarily installed the non-debug allocator.
A quick test shows alloc_for_runtime
does indeed install the non-debug allocators temporarily but that's before our threads exist. Same for _clear_preinit_entries
, and _Py_SetArgcArgv
.
I wasn't able to catch anything changing allocators during runtime, but I did reproduce the other version of the failure, when _PyMem_DebugRawFree
sees bad data. And sure enough, right before the pointer being free'd, there's just four 0xfd, which is the MSVC guard-rail, meaning this block was allocated by _PyMem_RawMalloc
(i.e. direct call to malloc), and not by PyMem_DebugRawMalloc
, which would have put two words of tracking data immediately before the pointer.
So yeah, I reckon that's the issue, and my best guess is that something in new_interpreter
is temporarily switching allocators.
(This might have been easier to spot earlier if Python and MSVC weren't using the same byte value (0xfd) for their guard value: What I earlier thought was an overwrite of the Python memory tracking data was actually the MSVC tracking data: LE size_t of the allocation length, 16-bits request number, and four 0xfd's).
tl;dr: It seems that something called under new_interpreter
is installing the (global) non-debug allocators temporarily, while other threads are running, and this causes some allocations to be allocated or free'd with the non-debug allocator, mismatching with the other operation in the debug allocator.
So this only affects debug builds (although using a custom allocator would also reproduce this in release builds, I guess), but should also happen on Linux AFAICT, but would need a debug allocator like MSVC's to actually catch it. (There might be a confounding factor, like the faulty code is Windows-only, I haven't tracked that down.)
Edit: Got it. _Py_ClearStandardStreamEncoding
called from new_interpreter
-> init_interp_main
-> init_sys_streams
is the culprit in 3.12. However, that was removed in 3.13 by #105154, so I guess something else must be calling _PyMem_SetDefaultAllocator
inside new_interpreter to reproduce this in 3.13 onwards.
Last edit for the night: I had a quick go with 3.13.0rc1, and the same code and config hangs because the import machinery now tries to switch to the main thread in import_run_extension
, and everyone's waiting for the GIL. The main thread is of course waiting in std::thread::join
in the vector destructor.
Making this change in main()
fixes the hang:
Py_BEGIN_ALLOW_THREADS;
execute();
Py_END_ALLOW_THREADS;
And now it doesn't seem to reproduce.
Since the problematic code was removed in 3.13, I suspect the 3.13/HEAD repro case seen in https://github.com/python/cpython/issues/123134#issuecomment-2308993008 is only when the GIL is disabled, and is either some GIL-disabled-only codepath triggering the same allocator/mismatch in other threads, or some other GIL-disabled bug with the same/similar symptom.
I tried adding set(Python_FIND_ABI "ANY" "ANY" "ANY" "ON")
before the find_package
call in CMake and target_compile_definitions(${PROJECT_NAME} PRIVATE Py_GIL_DISABLED=1)
at the end of the CMakeLists.txt, and now all the threads have hung in _PyEval_EnableGILTransient
, the call before import_run_extension
which is only in the non-GIL builds, waiting on a semaphore, so I think I'd need to better-understand how GIL-disabled works before I can track down the 3.13 repro case mentioned above.
Unrelated, but I cannot reproduce the data race in qsort
under TSan. Instead, TSan reports a segfault, but doesn't say where -- running under a debugger says the segfault is somewhere inside TSan itself, so I suspect there's something fishy happening on that front.
I may actually just have been too impatient last night with the free-threading test.
If I wait, it seems to eventually complete. There's a bit of a bottleneck in importing: create_builtin
->_PyEval_EnableGILTransient
->_PyEval_StopTheWorldAll
needs to block every thread for every builtin module import in any thread, and this seems to scale badly as more threads complete their initialisation, but does eventually complete AFAICT. (Debug MSVC concurrency primitives are probably not helping performance here either...)
This world-stoppage has the effect of serialising all the new_interpreter
calls immediately after their GIL is created, even with a private GIL, since their first action under the new GIL is to import _imp
via pycore_interp_init
->_PyImport_InitCore
which eventually hits create_builtin
as above.
There's a note in _PyEval_EnableGILTransient
(only using by the import mechanism):
// This could be an interpreter-local stop-the-world in situations where we
// know that this interpreter's GIL is not shared, and that it won't become
// shared before the stop-the-world begins. For now, we always stop all
// interpreters for simplicity.
_PyEval_StopTheWorldAll(&_PyRuntime);
so maybe this will be improved in future, since at least during new_interpreter
, I think the above conditions hold?
In the real world, this would be a thread-startup cost, and new interpreters already have to call _PyEval_StopTheWorldAll
at startup and finalisation to manage the "list of all threads", so it's not awful.
So at this point, I think this issue's original bug is actually "resolved-by-accident in 3.13", pending a new repro case. I'm not sure if it's interesting to fix in 3.12; it only affects debug builds but it is a potential memory-corruption issue. Since the functions were removed in 3.13 and previously deprecated, simply gutting them in 3.12 and complaining loudly if they are called would be viable, perhaps.
The qsort issue should be reproducible in 3.13 but it's quite possible that it has no actual effect despite being detectible by TSan, beyond wasting time with repeated qsort calls on already-sorted data. It depends on how well or badly qsort handles being run on the same array in multiple threads.
@TBBle Small note btw, per your comment and the find_package
support for Py_GIL_DISABLED=1
, but CMake 3.30.3 was released in the last few days and has this as a fix now. 🫡
PyGILState*
in isolated subinterpreters. @TBBle, it might be useful to create a new issue for the allocator problem to get some fresh eyes on it. It's buried in the comments of this issue right now :(@bruxisma, could you try to run the following code on your windows machine?
I think protecting the creation of a new subinterpreter with an ordinary posix mutex (or whatever C++ code uses, I don't know C++) should work to resolve this. If it does indeed work, it's perhaps worth documenting, and it would be a small ask of the users of the API (unlike for example asking them to hold the global GIL and then after the subinterpreter swaps it in for its own GIL somehow swapping it back to the original GIL to release -- it's a bit of a mess).
I would be curious to know what the results of the test are if you uncomment the mutex locking/unlocking.
A more interesting test case (I think), which I included in the test case below is concurrently hammering each subinterpreter with a lot of small C calls from lots of threads to check if there really is a good isolation of subinterpreters after the creation.
I'm on MacOS and the code works either with or without the mutex:
#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
// Define the number of threads to create
#define NUM_THREADS 500
// Define the upper limit for the range in our Python calculation
#define RANGE_LIMIT 10000000
// Configure the Python interpreter
PyInterpreterConfig config = {
.check_multi_interp_extensions = 1, // Check for multi-interpreter compatible extensions
.gil = PyInterpreterConfig_OWN_GIL, // Each interpreter gets its own GIL
};
// Define a struct to hold the result of each thread's execution
typedef struct {
int error_code;
char error_message[256];
} ThreadResult;
// Declare a global mutex to protect interpreter creation
pthread_mutex_t interpreter_mutex = PTHREAD_MUTEX_INITIALIZER;
// Function that each thread will run
void *run_python_code(void *arg) {
// Cast the argument to ThreadResult pointer
ThreadResult *result = (ThreadResult *)arg;
// Initialize the result
result->error_code = 0;
result->error_message[0] = '\0';
PyThreadState *tstate = NULL;
// Lock the mutex before creating a new interpreter
pthread_mutex_lock(&interpreter_mutex);
// Create a new Python interpreter
PyStatus status = Py_NewInterpreterFromConfig(&tstate, &config);
// Unlock the mutex after creating the interpreter
pthread_mutex_unlock(&interpreter_mutex);
// Check if interpreter creation was successful
if (PyStatus_Exception(status)) {
result->error_code = 1;
snprintf(result->error_message, sizeof(result->error_message),
"Couldn't spawn new interpreter");
return NULL;
}
// Declare Python objects we'll use
PyObject *pLocal, *pSum, *pRange, *pIter, *pItem;
// Create a new dictionary for local variables
pLocal = PyDict_New();
// Create a Python long object initialized to 0
pSum = PyLong_FromLong(0);
// Add 'sum' to the local dictionary
PyDict_SetItemString(pLocal, "sum", pSum);
// Create a range object
pRange = PyObject_CallFunction(PyObject_GetAttrString(PyImport_AddModule("builtins"), "range"), "l", RANGE_LIMIT);
// Get an iterator from the range
pIter = PyObject_GetIter(pRange);
// Iterate over the range
while ((pItem = PyIter_Next(pIter))) {
// Add the current item to the sum
PyObject *pNewSum = PyNumber_Add(pSum, pItem);
Py_DECREF(pSum);
pSum = pNewSum;
Py_DECREF(pItem);
}
// Print the final sum
PyObject_Print(pSum, stdout, 0);
printf("\n");
// Clean up Python objects
Py_DECREF(pLocal);
Py_DECREF(pSum);
Py_DECREF(pRange);
Py_DECREF(pIter);
// End the interpreter
Py_EndInterpreter(tstate);
return NULL;
}
int main() {
// Initialize the Python interpreter
Py_Initialize();
// Declare arrays to hold thread IDs and results
pthread_t threads[NUM_THREADS];
ThreadResult results[NUM_THREADS];
int num_threads_created = 0;
// Create threads
for (int i = 0; i < NUM_THREADS; i++) {
int ret = pthread_create(&threads[i], NULL, run_python_code, &results[i]);
if (ret) {
fprintf(stderr, "Error creating thread %d: %d\n", i, ret);
break;
} else {
printf("Successfully created thread %d:\n", i);
}
num_threads_created++;
}
// Wait for all created threads to finish
for (int i = 0; i < num_threads_created; i++) {
pthread_join(threads[i], NULL);
// Print any errors that occurred in the thread
if (results[i].error_code) {
fprintf(stderr, "Thread %d error: %s\n", i, results[i].error_message);
}
printf("Thread %d completed with error code: %d\n", i, results[i].error_code);
}
// Destroy the mutex
pthread_mutex_destroy(&interpreter_mutex);
// Finalize the Python interpreter
Py_Finalize();
return 0;
}
@picrin Which version(s) of Python did you try this code with? Also, you did test with a debug Python build, correct?
The mutex approach only protects the interpreter creation, which before Python 3.13 always ended up calling _PyMem_SetDefaultAllocator
, but since Python 3.13 no longer appear to always do so.
However, other APIs call _PyMem_SetDefaultAllocator
without a global lock, and so a repro case that calls one of those APIs in the multi-interpreter stage should be able to reproduce the issue in Python 3.13 or later.
I don't have pthreads set up for Windows so can't test your code myself, but I will note that in Python 3.13, the free-threaded build already implicitly mutexes interpreter creation across all threads because it takes a runtime-wide lock in the import machinery.
It's Python 3.12, that's how I compile the code on my installation of MacOS:
gcc -L /usr/local/Frameworks/Python.framework/Versions/3.12/lib/ -I /usr/local/Frameworks/Python.framework/Versions/3.12/Headers/ -l python3.12 subinterpreters.c
I don't have Python debug build, so it's just a regular python build. I just read the thread more carefully and I realised this only affects the debug build -- sorry!
I think as per your explanation likely this doesn't work as a fix -- especially if the underlying C API call is already doing this.
Thanks for explaining!
@TBBle After quite a bit of investigating, it looks like the allocation issue is actually a duplicate! See #116510 (the fix was #113412)
I was finally able to reproduce this issue, but it seems to occur only under specific conditions relating to allocation -- calling append
on a list did it in my case.
That's why this doesn't occur on 3.13 -- it's actually fixed there. Unfortunately, there seems to be hesitation to backport the fix because it's quite complicated.
After a quick look on my phone, i don't thInk that's the same issue, since it happens in release builds, and seems to involve confusion about shared obmalloc state between legacy subinterpreters.
In this issue's case, each interpreter has its own obmalloc state, but in debug builds they resolve down into the runtime's) raw allocator and in this repro, allication and deallocation are mispaired due to the raw allocator being switched by another thread.
I have to look more closely, but I suspect we avoided that other issue in 3.12 by not importing any native modules. Or maybe dyIng too quickly.
It might be a separate cause, but from what I can see, isolated subinterpreters are pretty broken on 3.12 -- debug builds possibly just make it easier to occur. I haven't read this entire thread; does this occur on debug builds for 3.13?
I don't have a repro for 3.13, but I expect it's possible. It only affects debug builds (or embedded Python with a custom raw allocator) due to the nature of the bug.
Here's the summary of this bug as I understand it:
malloc
/free
.free
: Probably a no-op in many cases (a harmless memory leak), but the MSVC debug allocator in particular detects this and throws a debug break from RtlValidateHeap
if you have a debugger attached. This version is the one that almost always reproduced in my testing of this repro.malloc
is passed to Python debug allocator's deallocation function: This should be caught in _PyMem_DebugCheckAddress
and reported via _PyObject_DebugDumpAddress
on all platforms. I haven't looked into why this didn't reproduce on Linux for anyone, but I only saw it very rarely in my tests (or if I ignored the RtlValidateHeap
failure and continued the run), so maybe we just didn't try hard enough? (Or maybe there was a per-platform code or thread-behaviour difference that defanged this reproduction, e.g., the relevant code might be compiled out or optimised down to a no-op on Linux reducing the repro window to a handful of instructions.)_PyMem_SetDefaultAllocator
(and a paired reversion function), and other threads may allocate or deallocate during this time._PyMem_SetDefaultAllocator
still happens in release builds, but since that's normally a no-op, there's no symptom._Py_ClearStandardStreamEncoding
) was removed in #105154.
_PyMem_SetDefaultAllocator
calls still exists in the Python 3.13 code base, but is no longer always called during sub-interpreter creation, so this simple repro case won't reproduce it. I haven't broken out a new bug report for this issue because I haven't had time yet to try and reproduce it in 3.13 with a new repro case._PyMem_SetDefaultAllocator
happens during either main-interpreter initialisation or main-interpreter teardown, in which case this bug would not reproduce in 3.13 onwards, entirely by accident.Clarifying note: I have not tried this on Linux, and reading back in the ticket, I don't see that anyone stated they tried the given repro case in a debug build on either Linux or MacOS. So maybe this is not actually Windows-only, just undertested on other platforms. (Does any MacOS or Linux Python distro ship a debug build that this could easily be tested against?)
Late update: A debug build is not necessary, just the debug memory hooks (which function like a custom allocator as I surmised above). As such, I was able to reproduce this on a Python 3.12 Linux package from Ubuntu.
- each interpreter has its own obmalloc state (a prerequisite for having separate GILs).
I was under the impression that was broken on 3.12, which is what #113412 fixed.
As far as I can tell (browsing on GitHub), before #113412, all interpreters had their own obmalloc state instance; sub-interpreters that were marked to share the main interpreters state would just never use their own state data, except in cases like #116510 where the sub-interpreter's obmalloc would be unintentionally used (during interpreter cleanup triggering sub-interpreter cleanup, I suspect).
I suspect that the pivotal fix for the segfault is actually in get_state
, as has_own_state
has a behaviour of "always returning true during main-interpreter finalisation", which would return the wrong obmalloc instance during main-interpreter finalisation for sub-interpreters that were supposed to share the main interpreter's sate. The simpler get_state
can now unconditionally return the interpreter's obmalloc state pointer, and is unaffected by this odd behaviour in has_own_state
.
The reason that does not affect this ticket's repro is that in our repro case, each sub-interpreter is already correctly returning true
from has_own_state
all the time and hence get_state
always returned the interpreter's private obmalloc instance, so #113412 would not have affected that; we never used the wrong obmalloc state instance in our repro case. That's why we never saw segfaults from this repro case.
I suspect remaining uses of has_own_state
are similarly at risk, but are generally either in asserts or memory-usage-counting, where an incorrect true
will either fail to fire an assertion, or over-count memory usage. Either way, I doubt anyone will ever notice it; without trying it, I suspect you cannot call the memory-accounting code (sys.getallocatedblocks()
) that uses has_own_state
during main-interpreter finalisation, and the caller in the main interpreter finalisation code is called after all sub-interpreters are finalised.
It might be a separate cause, but from what I can see, isolated subinterpreters are pretty broken on 3.12 -- debug builds possibly just make it easier to occur. I haven't read this entire thread; does this occur on debug builds for 3.13?
It occurs in debug builds for both 3.13 and 3.14 😔
It occurs in debug builds for both 3.13 and 3.14 😔
Was this with your original repro? I couldn't trigger it in 3.13.0rc1, but that was before I understood underlying failure, so it's possible that I did something wrong.
Since it's threading-related, how many cores does your test host have?
Late edit: Poking through the 3.13.0rc2 source locally, all calls to _PyMem_SetDefaultAllocator
are either in code that asserts itself as pre-init or post-finalize (import related), is in code that is config handling related and hence only happens pre-init, or are in deprecated APIs which would have been removed in #105154 except they are needed for the Stable ABI. (Some calls are in cases that match more than one of those...) So I cannot at this time identify a 3.13 repro for the original repo case by inspection.
As such, I can trivially reproduce the issue in 3.13 if I add a call Py_SetProgramName(L"");
after Py_NewInterpreterFromConfig
, i.e. in the threaded context. I don't think that's a high-urgency repro case, calling a nominally-removed API; and I can't at this time see a better repro case in 3.13.
Bug report
Bug description:
Hello. While working on a small joke program, I found a possible memory corruption issue (it could also be a threading issue?) when using the Python C API in a debug only build to quickly create, execute python code, and then destroy 463 sub interpreters. Before I post the code sample and the debug output I'm using a somewhat unique build environment for a Windows developer.
When running the code sample I've attached at the bottom of this post, I am unable to get the exact same output each time, though the traceback does fire in the same location (Due to the size of the traceback I've not attached it, as it's about 10 MB of text for each thread). Additionally, I sometimes have to run the executable several times to get the error to occur. Lastly, release builds do not exhibit any thread crashes or issues as the debug assertions never fire or execute.
The error output seems to also halt in some cases, either because of stack failure or some other issue I was unable to determine and seemed to possibly be outside the scope of the issue presented here. I have the entire error output from one execution where I was able to save the output.
The output cut off after this, as the entire program crashed, taking my terminal with it 😅
You'll find the MRE code below. I've also added a minimal version of CMakeLists.txt file I used so anyone can recreate the build with the code below (Any warnings, or additional settings I have do not affect whether the error occurs or not). The code appears to breaks inside of
_PyObject_DebugDumpAddress
, based on what debugging I was able to do with WinDbg.main.cxx
CMakeLists.txt
Command to build + run
CPython versions tested on:
3.12
Operating systems tested on:
Windows