python / cpython

The Python programming language

https://www.python.org

Other

63.88k stars 30.57k forks source link

Creating an ungodly amount of sub interpreters in a short amount of time causes memory debug assertions. #123134

Open bruxisma opened 3 months ago

bruxisma commented 3 months ago

Bug report

Bug description:

Hello. While working on a small joke program, I found a possible memory corruption issue (it could also be a threading issue?) when using the Python C API in a debug only build to quickly create, execute python code, and then destroy 463 sub interpreters. Before I post the code sample and the debug output I'm using a somewhat unique build environment for a Windows developer.

clang++ 18.1.7 for x86_64-pc-windows-msvc
Visual Studio Build Tools 2022 17.11.0
CMake 3.30
Python 3.12.5
ninja 1.12.1

When running the code sample I've attached at the bottom of this post, I am unable to get the exact same output each time, though the traceback does fire in the same location (Due to the size of the traceback I've not attached it, as it's about 10 MB of text for each thread). Additionally, I sometimes have to run the executable several times to get the error to occur. Lastly, release builds do not exhibit any thread crashes or issues as the debug assertions never fire or execute.

The error output seems to also halt in some cases, either because of stack failure or some other issue I was unable to determine and seemed to possibly be outside the scope of the issue presented here. I have the entire error output from one execution where I was able to save the output.

Debug memory block at address p=00000267272D6030: API 'p'
    16718206241729413120 bytes originally requested
    The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfd):
        at p-7: 0xa0 *** OUCH
        at p-6: 0x00 *** OUCH
        at p-5: 0x00Hello, world! From Thread 31
Hello, world! From Thread 87
 *** OUCH
        at p-4: 0xfd
Hello, world! From Thread 43
Hello, world! From Thread 279
Generating thread state 314
        at p-3: 0xfd
        at p-2: 0xfd
Hello, world! From Thread 168
Generating thread state 315
        at p-1: 0xfd
    Because memory is corrupted at the start, the count of bytes requested
       may be bogus, and checking the trailing pad bytes may segfault.
Generating thread state 316
Generating thread state 317
    The 8 pad bytes at tail=E8030267272D6030 are

The output cut off after this, as the entire program crashed, taking my terminal with it 😅

You'll find the MRE code below. I've also added a minimal version of CMakeLists.txt file I used so anyone can recreate the build with the code below (Any warnings, or additional settings I have do not affect whether the error occurs or not). The code appears to breaks inside of _PyObject_DebugDumpAddress, based on what debugging I was able to do with WinDbg.

[!IMPORTANT] std::jthread calls .join() on destruction, so all threads auto-join once the std::vector goes out of scope.

Additionally this code exhibits the same behavior regardless of whether it is a thread_local or declared within the lambda passed to std::thread

main.cxx

#include <vector>
#include <thread>
#include <cstdlib>
#include <print>

#include <Python.h>

namespace {

static thread_local inline PyThreadState* state = nullptr;
static inline constexpr auto MAX_STATES = 463; 
static inline constexpr auto config = PyInterpreterConfig {
  .use_main_obmalloc = 0,
  .allow_fork = 0,
  .allow_exec = 0,
  .allow_threads = 0,
  .allow_daemon_threads = 0,
  .check_multi_interp_extensions = 1,
  .gil = PyInterpreterConfig_OWN_GIL,
};

} /* nameless namespace */

void execute () {
  std::vector<std::jthread> tasks { };
  tasks.reserve(MAX_STATES);
  for (auto count = 0zu; count < tasks.capacity(); count++) {
    std::println("Generating thread state {}", count);
    tasks.emplace_back([count] {
      if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
        std::println("Failed to initialize thread state {}", count);
        return;
      }
      auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
      auto globals = PyDict_New();
      auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
      auto result = PyEval_EvalCode(code, globals, globals);
      Py_DecRef(result);
      Py_DecRef(code);
      Py_DecRef(globals);
      Py_EndInterpreter(state);
      state = nullptr;
    });
  }
}

int main() {
  PyConfig config {};
  PyConfig_InitIsolatedConfig(&config);
  if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status)) {
    std::println("Failed to initialize with isolated config: {}", status.err_msg);
    return EXIT_FAILURE;
  }
  PyConfig_Clear(&config);
  execute();
  Py_Finalize();
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.30)
project(463-interpreters LANGUAGES C CXX)

find_package(Python 3.12 REQUIRED COMPONENTS Development.Embed)

set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>")

add_executable(${PROJECT_NAME})
target_sources(${PROJECT_NAME} PRIVATE main.cxx)
target_compile_features(${PROJECT_NAME} PRIVATE cxx_std_23)
target_precompile_headers(${PROJECT_NAME} PRIVATE <Python.h>)
target_link_libraries(${PROJECT_NAME} PRIVATE Python::Python)

Command to build + run

$ cmake -Bbuild -S. -G "Ninja"
$ cmake --build build && .\build\463-interpreters.exe

CPython versions tested on:

3.12

Operating systems tested on:

Windows

picnixz commented 3 months ago

Can you reproduce this behaviour on Linux? or MacOS? Or what happens if you sleep a bit before spawning the sub-interpreters? I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/ Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available)

ZeroIntensity commented 3 months ago

At a glance, this could possibly be related to a memory allocation failure or something similar.

  auto globals = PyDict_New();
  auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
  auto result = PyEval_EvalCode(code, globals, globals);

These lines need a NULL check if an error occurred, otherwise there will be a NULL pointer dereference in the call to Py_DecRef.

bruxisma commented 3 months ago

At a glance, this could possibly be related to a memory allocation failure or something similar.
  auto globals = PyDict_New();
  auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
  auto result = PyEval_EvalCode(code, globals, globals);
These lines need a NULL check if an error occurred, otherwise there will be a NULL pointer dereference in the call to Py_DecRef.

Py_DecRef is the function version of Py_XDECREF. Though I can see the confusion between Py_DecRef and Py_DECREF. If there was a nullptr, we'd have a different set of errors, and the debug memory assertions wouldn't be complaining that all memory was not written with the FORBIDDENBYTE value.

bruxisma commented 3 months ago

Can you reproduce this behaviour on Linux? or MacOS? I will work on getting the Linux/macOS behaviors checked when I can. My WSL2 environment is woefully out of date 😅. macOS might be more difficult, as I'm unaware of how the _d version of the Python.framework might be linked against.

Or what happens if you sleep a bit before spawning the sub-interpreters?

Sleeping might fix it, but might also hide the issue. The goal here was just to run a bunch of interpreters, not sleep for the sake of working around possible synchronization issues

I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/

This is not how threads or operating systems work. There is a synchronization issue or an ABA problem with the debug memory assertions in the C API's internals

Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available)

I will also work on giving this a go.

ZeroIntensity commented 3 months ago

Py_DecRef is the function version of Py_XDECREF

Oh! TIL. I still would suggest adding a NULL check -- an error indicator could be set, which would break subsequent calls to PyErr_Occurred.

bruxisma commented 3 months ago

The error is outside of the Python C API as one might usually interact with it. i.e., only on subinterpreter initialization (i.e., creating a PyThreadState) or on subinterpreter destruction (Py_EndInterpreter) does the error occur (typically the latter). PyErr_Occurred is not being called anywhere in the MRE, nor in the actual traceback for the failure.

ZeroIntensity commented 3 months ago

Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific.

bruxisma commented 3 months ago

@picnixz I can confirm the behavior still exists on Windows on HEAD. Same errors can occur

Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific.

PyErr_Occurred only performs a read on the current thread state object. If an error is present when it shouldn't be that still points to a memory corruption error occurring between the destruction and initialization of a PyThreadState.

bruxisma commented 3 months ago

Pleased to back up @ZeroIntensity and say that this appears to be a Windows specific issue. I was unable to replicate this memory assertion error on Linux.

However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place.

WARNING: ThreadSanitizer: data race (pid=11259)
  Write of size 8 at 0x7f24e35e16c0 by thread T10:
    #0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
    #3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
    #4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)
    ...
  Previous write of size 8 at 0x7f24e35e16c0 by thread T18:
    #0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
    #3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
    #4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)

  Location is global '??' at 0x7f24e2e85000 (libpython3.12.so.1.0+0x75c880)

This only happened 3 times, and around the 42 thread generation step on average.

TBBle commented 3 months ago

However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place.

A quick look suggests this TSan failure is a local data-race to posixmodule.c: There appears to be no protection from multiple threads importing posixmodule simultaneously, and hence multiple threads trying to simultaneously qsort in-place one of the three static tables.

This module is used on Windows, but I'm not sure if any of the macros enabling those qsort calls (HAVE_FPATHCONF, HAVE_PATHCONF, HAVE_CONFSTR, or HAVE_SYSCONF) are enabled in the Windows build. My guess would be no...

I don't know if overlapping qsorts on the same array will actually corrupt memory, but I doubt there's any guarantee it won't. (qsort_r doesn't help, that makes the comparison function thread-safe, not the array being sorted, according to the manpage.)

So probably a separate issue, and a more-focused/reliable MRE could be made for it.

ZeroIntensity commented 3 months ago

To think of it, I'm not even sure subinterpreters are thread safe in the first place, you might need to acquire the GIL. Though, I'm not all too familiar with them.

cc @ericsnowcurrently for insight

TBBle commented 3 months ago

I think the point of PEP-684 (in Python 3.12) is that you can have a per-subinterpreter GIL, as is being done in this test-case. posixmodule.c claims compatibility with the per-interpreter GIL, so in that case I think it's a straight-up bug.

The check_multi_interp_extensions flag has also been set, so any modules known to not work with per-subinterpreter GIL should refuse to import before running any init code anyway.

Edit: Oops, perhaps I misunderstood your comment, and it wasn't directed as the posixmodule issue.

Indeed, calling Py_NewInterpreterFromConfig documents that the GIL must be held for that call, and the own-GIL case does interesting things with the GIL-holding state of both new and current interpreter.

Like all other Python/C API functions, the global interpreter lock must be held before calling this function and is still held when it returns. Likewise a current thread state must be set on entry. On success, the returned thread state will be set as current. If the sub-interpreter is created with its own GIL then the GIL of the calling interpreter will be released. When the function returns, the new interpreter’s GIL will be held by the current thread and the previously interpreter’s GIL will remain released here.

So I'm not clear if the given example is doing the right thing here or not, but I suspect the first thing the tasks should do is take the main interpreter's GIL if not already holding it. (And the call to Py_Finalize should do that too, in that case). Poking around; because it's a fresh thread, maybe it doesn't need the main GIL to call Py_NewInterpreterFromConfig, and so this code is GIL-safe as is. I'm not clear if the new per-subinterpreter-GIL support changed things around here, and if so, how it changed.

ZeroIntensity commented 3 months ago

I'm guessing that's the problem Judging by the comment, it looks like new_interpreter doesn't support being called without the GIL: https://github.com/python/cpython/blob/bffed80230f2617de2ee02bd4bdded1024234dab/Python/pylifecycle.c#L2261-L2263

I could see this being a bug, though. I'll wait for Eric to clarify.

bruxisma commented 3 months ago

Well this definitely brings me some concern then, as the GIL can be made optional in 3.13 😅

ZeroIntensity commented 3 months ago

Does this occur on the free threaded build?

bruxisma commented 3 months ago

Sorry for the delay. Had a busy work week. I'll gather this info this weekend, though the code seems to indicate "yes" from a glance.

bruxisma commented 3 months ago

@ZeroIntensity I can confirm the issue still happens when the GIL is disabled in 3.13 and 3.14 HEAD

ZeroIntensity commented 3 months ago

For now, I think I'm going to write this off as user-error, as Py_NewInterpreterFromConfig (and friends) doesn't seem to support being called without the GIL. Though, it's a bit confusing as to why this only fails on Windows. Does it still occur if you call PyGILState_Ensure (and subsequently PyGILState_Release) before initializing the subinterpreter?

bruxisma commented 3 months ago

How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created?

You cannot call PyGILState_Release on these threads because when you call Py_NewInterpreterFromConfig it sets the new interpreter as the current one for said thread. This would require swapping to the old state to unlock it and then swapping back to the new state.

ZeroIntensity commented 3 months ago

How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created?

AFAIK, PyGILState_* calls don't actually acquire the GIL on free-threaded builds, they just initialize the thread state structure to allow calling the API.

At a glance, the drop-in code would be:

void execute () {
  std::vector<std::jthread> tasks { };
  tasks.reserve(MAX_STATES);
  for (auto count = 0zu; count < tasks.capacity(); count++) {
    std::println("Generating thread state {}", count);
    tasks.emplace_back([count] {
      PyGILState_STATE gil = PyGILState_Ensure(); // Make sure we can call the API
      if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
        std::println("Failed to initialize thread state {}", count);
        PyGILState_Release(gil);
        return;
      }
      auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
      auto globals = PyDict_New();
      auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
      auto result = PyEval_EvalCode(code, globals, globals);
      Py_DecRef(result);
      Py_DecRef(code);
      Py_DecRef(globals);
      Py_EndInterpreter(state);
      state = nullptr;
      PyGILState_Release(gil); // Release the GIL (only on non-free-threaded builds)
    });
  }
}

There needs to be a Py_BEGIN_ALLOW_THREADS in there, though. Normally, that would happen over the calls to join the threads, but I'm assuming that happens in a destructor here -- I'm not too familiar with C++ threading, you'll have to figure out where to put that.

bruxisma commented 3 months ago

It shouldn't require any of that. The interpreter variable is local to the thread. The Py_BEGIN_ALLOW_THREADS macro is required when manipulating the GIL. There is no GIL manipulation needed because each of these are isolated threads running on their own, and even when the GIL is disabled there shouldn't be any manipulation of the GIL occurring regardless because it's completely turned off

The need for PyGILState_STATE goes away when using isolated subinterpreters. The state is only needed if you're executing these subinterpreters to talk to each other. But these interpreters are running entirely on their own with no interaction between each other. Were I trying to register additional threads under a specific interpreter instance, then your code would be warranted. However, the documentation for these functions clearly says

Note that the PyGILState_* functions assume there is only one global interpreter (created automatically by Py_Initialize()). Python supports the creation of additional interpreters (using Py_NewInterpreter()), but *mixing multiple interpreters and the PyGILState_ API is unsupported.

(emphasis mine)

So, either the documentation is wrong, or Python's C API is wrong. Either way, there is a bug here.

ZeroIntensity commented 3 months ago

It shouldn't require any of that. The interpreter variable is local to the thread.

Yes, but how do you initialize that interpreter variable? I don't think new_interpreter does it (unless I'm missing something -- please clarify!)

I think you're misinterpreting what I'm saying: you're right, you don't need to mess with the GIL when using the subinterpreter, but it seems (based on the code from new_interpreter) that you need to hold it in order to create a subinterpreter. But that could be wrong, I'm just going off a comment. I'm still not sure why it only happens after a certain number of threads -- it should happen on the first thread, and it certainly shouldn't be Windows-specific.

bruxisma commented 3 months ago

yes it is initialized. that's what auto status = Py_NewInterpreterFromConfig(&state, &config); does. If it didn't work status would be an error.

I'm very certain this is to do with memory corruption and memory blocks within the debug assertions not being locked or cleared out properly which is leading to this invalid state occurring. Why it only happens on Windows most likely has to do with behavior regarding the MSVC runtime in some capacity, and some behavioral difference with windows memory pages and the virtual allocator versus posix platforms surfacing.

That's the best I've been able to intuit, as the crash is indeterminate in nature, though as I stated in an earlier comment ThreadSanitizer found some race conditions brought on by qsort on Linux.

ZeroIntensity commented 3 months ago

If it didn't work status would be an error.

It would probably segfault instead, FWIW -- C API calls don't have an extra check for the interpreter or thread state. I don't see where in new_interpreter it initializes the thread state information, so I would think that the failure would be in _PyThreadState_GET. However, that doesn't seem like that's the case, somehow.

Speculatively, the race could be caused by here https://github.com/python/cpython/blob/bffed80230f2617de2ee02bd4bdded1024234dab/Python/pylifecycle.c#L2251

Trying to set this across multiple threads could be problematic, since there's no lock.

bruxisma commented 3 months ago

If any of those failures were happening there, this same behavior would be observable in a release build. This situation only happens when the _d python library is used.

ZeroIntensity commented 3 months ago

The whole issue overall screams data race to me, it's quite odd that this only occurs on Windows debug builds. I would guess that this is present in release builds as well, just maybe not with this reproducer. Regardless, I'm like 70% sure that a lock should be acquired when writing to the runtime state, through HEAD_LOCK. It might not be the cause of this issue, but that looks like a bug to me nonetheless. Does adding HEAD_LOCK and HEAD_UNLOCK to new_intepreter fix the problem?

TBBle commented 3 months ago

Having a quick poke around (in 3.13 on GitHub): new_interpreter initialises the PyThreadState** passed in, or nulls it on failure. It appears fully-prepared to handle a null current-thread state and "new interpreter has its own GIL" state (although the comments for this logic appear to still assume that this only happens in the first-time "main interpreter" case) and HEAD_LOCK is used around the logic that adds the new interpreter to the global list of interpreters. I don't think that single write to disable GIL checking needs a lock, if this was going wrong somehow,

Basically, the things that PyGILState_Ensure does to make a thread Python-enabled are also done by new_interpreter. There's a call to bind_gilstate_tstate in both paths which stores the tstate into the TLS variable which the PyGILState_* API relies upon.

So I don't think the original code breaks the actual preconditions, at least that I noticed in browsing, and hence the docs that suggest a GIL must exist for calls to Py_NewInterpreterFromConfig probably need to be clarified to note that this only applies to Python-enabled threads, and that on a "Non-Python created" thread, you don't need a current thread state, and actually can't have one, as the non-existence of a thread state is what defines a "non-Python created" thread.

The comment "but mixing multiple interpreters and the PyGILState_* API is unsupported." is also a "here be dragons" coarse version of the clearer comment in the per-interpreter GIL docs:

Also note that combining this functionality with PyGILState_* APIs is delicate, because these APIs assume a bijection between Python thread states and OS-level threads, an assumption broken by the presence of sub-interpreters. It is highly recommended that you don’t switch sub-interpreters between a pair of matching PyGILState_Ensure() and PyGILState_Release() calls. Furthermore, extensions (such as ctypes) using these APIs to allow calling of Python code from non-Python created threads will probably be broken when using sub-interpreters.

The converse of that last paragraph is that PyGILState_* should work fine when you are tying sub-interpreters to OS-level threads, as seen in this example. So we're not even in conflict with the docs, there's just some parts of the documentation that don't explicitly take account of the "one-(sub)interpreter-per-thread" case.

https://github.com/python/cpython/issues/59956#issuecomment-1749152745 seems to agree with the above, as it notes that PyGILState_Ensure() should no longer crash or do the wrong thing with sub-interpreters, but does not yet support controlling which interpreter is used when it creates a new thread state, that always uses the main interpreter. That wouldn't be a problem in the structure used here, as Py_NewInterpreterFromConfig is creating the new thread state, and then any further theoretical PyGILState_* calls are safe and will use that interpreter, from the current threadstate which points to it. So I don't expect problems from extension modules manipulating the GIL in this structure either, unless they are spinning up their own threads and calling PyGILState_* APIs. (That's #59956's decade-old unresolved use-case, so it's not new with any of the recent changes.)

That said, adding PyGILState_Ensure() before Py_NewInterpreterFromConfig would not be incorrect (just wasteful...), and if that fixed things, would help point towards the source of the issue. (Conversely, if it doesn't help, that probably rules out a class of issues). Py_NewInterpreterFromConfig is documented as releasing the GIL held when it was entered so no matching PyGILState_Release() or Py_<VERB>_ALLOW_THREADS calls should be needed. I couldn't immediately see where this was done from new_interpreter though, so I'm trusting the docs on that one.

ZeroIntensity commented 3 months ago

After some debugging, it seems I was wrong! HEAD_LOCK and PyGILState_Ensure don't seem to fix the data race on Linux.

TBBle commented 3 months ago

Looking at the memory dump data from p=00000267272D6030, the pre-buffer bytes appear to be E8 03 00 00 00 00 00 00 70 a0 00 00 fd fd fd fd. That should be a big-endian size_t "bytes-allocated", then one of the chars 'r', 'm', or 'o', followed by seven 0xfd.

Instead it looks like we got overwritten with a 32-bit little-endian 0x00000000000003e8 (1000) followed by 16-bit 0x0000a70 (2672). (That first one could also be 16-bit 1000 followed by 16-bit 0, no way to tell. We don't see the preceding bytes either. They don't look like ASCII to me, anyway.) The 1000 is nice and round, but neither one jumps out with any idea of what might be overwriting that data.

So no obvious clue as to whether this only happens in the debug build, or if it's only detected in the debug build. In release builds there's no bookkeeping in live blocks or release-time validity checking, so if the overwrite is happening in non-debug builds we may never see evidence for it. The first thing in a PyObject is the ssize_t ref count, so in release-time builds, an overwrite like this would prevent the overwritten object from being deallocated, but not otherwise mess with anything off-hand. (GIL-disabled builds look like they would be more-damaged by overwriting the first uintptr_t of the object, but it may not be visible depending on the overwrite and the structure of that uintptr_t.)

Since TSAN isn't flagging this, I suspect these are overlapping objects from the same thread: an object is either being allocated from the wrong pool, or is writing past its end-bounds and hitting the next object in the pool. (Perhaps the damaged object is raw-alloc'd? I don't know if I'd expect it to be 16-bit aligned, but maybe I'm wrong to assume a malloc implementation would generally provide size_t-aligned blocks. The object allocator does return 16-bit aligned blocks though.)

Anyway, I expect if you can catch the failure in a debugger, you could walk backwards a bit in memory and find the actual object causing the problem, if that is a PyObject, and it's not itself corrupted. Since this appears to be being triggered by the sub-interpreters, I wonder if it's something in the "non-main obmalloc" support, e.g. one of the on-heap obmalloc support structures, that in the main obmalloc live inside a static object.

The preceding object probably isn't a PyObject, since that should end with a size_t of 0xfd and we can only see 4 0xfd in the dump (if the two paddings overlapped). Since the corrupted object is being deallocated, its ref-count wasn't overwritten with another four 0xfd bytes. (Or both objects are live and are busy overwriting each other's padding in turn in a nightmare of undebuggability... indebuggability... one of those must be a real word? Anyway, I think this shouldn't happen in the same thread, so barring more evidence, not worrying too much about that case.)

Which leans me harder towards "one of the obmalloc tracking structures on the heap blows out its bounds and corrupts the first-allocated PyObject from the first-created pool, which immediately follows it". That points the finger at pool_header, I think, and it helpfully ends with three uints. But I have only skimmed this code, and those are always on the heap anyway, which fails an earlier assumption I made to get this this point. And the actual values observed don't really fit those last three uints anyway. And 0x00000267272D6030 (minus two size_t's) doesn't seem right for the first block in the pool, which is a 16kiB-aligned block with 4 pointers and then 4 uints in the header structure. So for this to be the result, means there's something badly wrong with obmalloc already, and this has just revealed it.

So maybe a dead-end after all.

Is it possible to reproduce this failure with use_main_obmalloc set to true, in the freethreaded build? Or is the "Own GIL requires non-main obmalloc" rule still enforced in that case?

TBBle commented 3 months ago

After some debugging, it seems I was wrong! HEAD_LOCK and PyGILState_Ensure don't seem to fix the data race on Linux.

Wait, do you have a repro-case on Linux? Or were you looking at the TSan failure? (If the latter, put a critical section of some kind around the qsort calls and it should be fixed. Not the best fix, but at least proves or disproves the point.)

TBBle commented 3 months ago

I reproduced using the MSVC++ 17.11 Community edition compiler (under VS Code) and the clearest failure is

HEAP[463-interpreters.exe]: Invalid address specified to RtlValidateHeap( 000001F0B27D0000, 000001F0BA20AE30 )

from a PyObject_Free call, so far I've noticed it inside dict resize and list resize operations during initial "site" import during new_interpreter. The first parameter is the heap address, the second is the address being checked. I am not sure, but I believe it's failing because that address was not allocated from the heap in the first place. (The pointer passed to free is actually 0x30 later than the above address, so I assume that's MSVC heap-management metadata before the allocation; if this wasn't a heap allocation, then it's nonsense.)

Anyway, what I guess is going on is that the object is being free'd in the wrong thread, so when _PyObjectFree asks pymalloc_free to free the object, the latter comes back with "not my allocation", and so it's passed to PyMem_RawFree, which if not stopped by RtlValidateHeap as seen here, will presumably end up back on the RtlHeap free-list and (in debug builds) have its management data blatted and possibly be given back out later for someone else to stomp on as a new heap allocation. (That blatting is why I suspect this only reproduces in debug builds, as this program may not run long enough to get that block back as a new allocation.)

The free'd object is generally on-the-way-out, in the list and dict resize operations it was the "oldXXX" pointer.

I expect that the reason RtlValidateHeap fails is that this address is inside a heap allocation being used for an arena, not the start of the allocation. This repro seems happen to occur reliably with only 50 threads, and I saw it with 20 threads but not reliably, and was unable to reproduce it with 15 threads on my 16-core NUC, but I didn't try really hard. Sometimes it misses the RtlValidateHeap but fails _PyMem_DebugCheckAddress, suggesting that RtlValidateHeap isn't perfect.

I just now had a RtlValidateHeap failure happen in PyThread_free_lock twice, which is actually supposed to be a direct raw free, not a pool object being accidentally passed to _PyMem_RawFree, so I'm not sure why that failed. Even if that was freed from the wrong thread or wrong interpreter, it shouldn't have been an issue.

Edit: Thinking about this, I realised (and tested) that the repro does not require any calls between Py_NewInterpreterFromConfig and Py_EndInterpreter.

Edit again (last one tonight):

I think the following is the most-isolated example of a failure case:

marshal.c:1830 (Python 3.12)

    if ((rf.refs = PyList_New(0)) == NULL)
        return NULL;
    result = read_object(&rf);
    Py_DECREF(rf.refs);

The file itself is 44kb, in case it turns out to be something about a particular file. Although I hit it again with a smaller buffer (and smaller object being freed) so probably not.

Inside a block of C code, we create a new list, stash it somewhere not visible from anywhere else, pass it into a function, and then free it.

During the free, I'm seeing the list had 465 items in it, and the failure came when decrefing item 128 (going from the back). I don't know what type that item had, as its data was already blatted by the raw-free logic after passing _PyMem_DebugCheckAddress (so it wasn't itself blatantly corrupted), however I could see that it used PyObject_Free in its type's tp_free member, and it was an 1183-byte object, so I assume it's a var-size object that went through a realloc itself somewhere, but is not itself a container of PyObjects. (Maybe a string or buffer? It's importing from codecs or io-related modules in the marshal.c repro case instance AFAICT.)

Given the size, it would have not been in the pool allocator but should have fallen through to heap allocation, so I'm now pretty doubtful about my earlier "freeing in wrong obmalloc" theory, and unsure what exactly RtlValidateHeap is complaining about.

That said, since this is an unmarshall, I believe it can reference things that already exist, so maybe it's making itself visible to another thread by this mechanism?

ZeroIntensity commented 3 months ago

I'm wondering if the Linux and Windows parts are separate issues. The data race occurs on just two threads and seems to happen in Py_NewInterpreterFromConfig, not a deallocator.

TBBle commented 3 months ago

@ZeroIntensity If you're talking about the TSan failure seen in posixmodule on Linux, then yes, that's a separate issue revealed by the same code. It's trivially a static array being passed to qsort from multiple threads at once, with no synchronisation. The fix used in similar other cases (search for qsort in the bug tracker) was to make the lists correctly-sorted in the source code rather than sorting during init.

ZeroIntensity commented 3 months ago

I'll create a separate issue later today then, thanks!

TBBle commented 3 months ago

Okay, I have a more-isolated example which points to maybe there just being a nasty bug in the debug allocator support somewhere.

I just got the RtlValidateHeap failure in _extensions_cache_get on the PyMem_RawFree(key); call:

static PyModuleDef *
_extensions_cache_get(PyObject *filename, PyObject *name)
{
    PyModuleDef *def = NULL;
    void *key = NULL;
    extensions_lock_acquire();

    if (EXTENSIONS.hashtable == NULL) {
        goto finally;
    }

    key = hashtable_key_from_2_strings(filename, name, HTSEP);
    if (key == NULL) {
        goto finally;
    }
    _Py_hashtable_entry_t *entry = _Py_hashtable_get_entry(
            EXTENSIONS.hashtable, key);
    if (entry == NULL) {
        goto finally;
    }
    def = (PyModuleDef *)entry->value;

finally:
    extensions_lock_release();
    if (key != NULL) {
        PyMem_RawFree(key);
    }
    return def;
}

That key value comes from hashtable_key_from_2_strings which allocated the key with char *key = PyMem_RawMalloc(size);. Everything except the PyMem_RawFree call is inside the lock, and the binary data at the address specified by key is as below: the _codecs:_codecs line is the data at key, the other lines show the surrounding hex data:

28 00 00 00 00 00 00 00 9d 6e 00 00 fd fd fd fd
00 00 00 00 00 00 00 10 72 fd fd fd fd fd fd fd
_codecs:_codecs
fd fd fd fd fd fd fd fd fd fd fd fd 2e 00 2a 00

Everything except the PyMem_RawFree happened inside the extension lock, which is

PyThread_acquire_lock(_PyRuntime.imports.extensions.mutex, WAIT_LOCK);

and

PyThread_release_lock(_PyRuntime.imports.extensions.mutex);

i.e., it's global, not per-interpreter.

Oh, erk. Looking at the stack trace, it ends with:

<crt stuff...>
python312_d.dll!_PyMem_RawFree(void * _unused_ctx, void * ptr) Line 73 (c:\Users\paulh\OneDrive\Documents\Python\cpython\Objects\obmalloc.c:73)
python312_d.dll!PyMem_RawFree(void * ptr) Line 685 (c:\Users\paulh\OneDrive\Documents\Python\cpython\Objects\obmalloc.c:685)
python312_d.dll!_extensions_cache_get(_object * filename, _object * name) Line 988 (c:\Users\paulh\OneDrive\Documents\Python\cpython\Python\import.c:988)

but that's wrong. PyMem_RawFree is supposed to call _PyMem_DebugRawFree, not _PyMem_RawFree. _PyMem_RawFree is literally just free, but _PyMem_DebugRawFree first walks backwards two words, calls _PyMem_DebugCheckAddress and passes that new address to PyMem_RawFree.

And we can see in the memory dump that the _PyMem_DebugMalloc tracking data is correctly present (prefix big-ending size_t 0x10, then r, then 3 0xfd; suffix is 8 0xfd's). You can then see the MSVC debug heap guards around that (4 0xfd's either side, and I assume that 0x28 at the start is the MSVC-visible allocation size: 0x10 for the Python data plus 3 size_t's for the _PyMem_DebugMalloc tracking data)

So in short, I think the underlying issue is that something is causing PyMem_RawFree to call _PyMem_RawFree instead of _PyMem_DebugRawFree and that's corrupting the heap.

#define _PyMem_Raw (_PyRuntime.allocators.standard.raw)
//...
void PyMem_RawFree(void *ptr)
{
    _PyMem_Raw.free(_PyMem_Raw.ctx, ptr);
}

I confirmed with a watch in VSCode that _PyRuntime.allocators.standard.raw.free is 0x00007fff01e32ca7 {python312_d.dll!_PyMem_DebugRawFree}, so the right version of _pymem_allocators_standard_INIT appears to be being executed (and clearly at allocation time we got the right allocator, but I'm wondering if a bug in another thread temporarily installed the non-debug allocator.

A quick test shows alloc_for_runtime does indeed install the non-debug allocators temporarily but that's before our threads exist. Same for _clear_preinit_entries, and _Py_SetArgcArgv.

I wasn't able to catch anything changing allocators during runtime, but I did reproduce the other version of the failure, when _PyMem_DebugRawFree sees bad data. And sure enough, right before the pointer being free'd, there's just four 0xfd, which is the MSVC guard-rail, meaning this block was allocated by _PyMem_RawMalloc (i.e. direct call to malloc), and not by PyMem_DebugRawMalloc, which would have put two words of tracking data immediately before the pointer.

So yeah, I reckon that's the issue, and my best guess is that something in new_interpreter is temporarily switching allocators.

(This might have been easier to spot earlier if Python and MSVC weren't using the same byte value (0xfd) for their guard value: What I earlier thought was an overwrite of the Python memory tracking data was actually the MSVC tracking data: LE size_t of the allocation length, 16-bits request number, and four 0xfd's).

tl;dr: It seems that something called under new_interpreter is installing the (global) non-debug allocators temporarily, while other threads are running, and this causes some allocations to be allocated or free'd with the non-debug allocator, mismatching with the other operation in the debug allocator.

So this only affects debug builds (although using a custom allocator would also reproduce this in release builds, I guess), but should also happen on Linux AFAICT, but would need a debug allocator like MSVC's to actually catch it. (There might be a confounding factor, like the faulty code is Windows-only, I haven't tracked that down.)

Edit: Got it. _Py_ClearStandardStreamEncoding called from new_interpreter -> init_interp_main -> init_sys_streams is the culprit in 3.12. However, that was removed in 3.13 by #105154, so I guess something else must be calling _PyMem_SetDefaultAllocator inside new_interpreter to reproduce this in 3.13 onwards.

Last edit for the night: I had a quick go with 3.13.0rc1, and the same code and config hangs because the import machinery now tries to switch to the main thread in import_run_extension, and everyone's waiting for the GIL. The main thread is of course waiting in std::thread::join in the vector destructor.

Making this change in main() fixes the hang:

    Py_BEGIN_ALLOW_THREADS;
    execute();
    Py_END_ALLOW_THREADS;

And now it doesn't seem to reproduce.

Since the problematic code was removed in 3.13, I suspect the 3.13/HEAD repro case seen in https://github.com/python/cpython/issues/123134#issuecomment-2308993008 is only when the GIL is disabled, and is either some GIL-disabled-only codepath triggering the same allocator/mismatch in other threads, or some other GIL-disabled bug with the same/similar symptom.

I tried adding set(Python_FIND_ABI "ANY" "ANY" "ANY" "ON") before the find_package call in CMake and target_compile_definitions(${PROJECT_NAME} PRIVATE Py_GIL_DISABLED=1) at the end of the CMakeLists.txt, and now all the threads have hung in _PyEval_EnableGILTransient, the call before import_run_extension which is only in the non-GIL builds, waiting on a semaphore, so I think I'd need to better-understand how GIL-disabled works before I can track down the 3.13 repro case mentioned above.

This is what my repro-attempt for 3.13 looks like now

```cmake cmake_minimum_required(VERSION 3.30) project(463-interpreters LANGUAGES C CXX) set(Python_FIND_ABI "ANY" "ANY" "ANY" "ON") find_package(Python 3.13 REQUIRED COMPONENTS Development.Embed) set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$:Debug>") add_executable(${PROJECT_NAME}) target_sources(${PROJECT_NAME} PRIVATE main.cxx) target_compile_features(${PROJECT_NAME} PRIVATE cxx_std_23) target_precompile_headers(${PROJECT_NAME} PRIVATE ) target_link_libraries(${PROJECT_NAME} PRIVATE Python::Python) # find_package ought to do this when free-threaded support is requested. target_compile_definitions(${PROJECT_NAME} PRIVATE Py_GIL_DISABLED=1) ``` ```c++ #include #include #include #include #include // https://developercommunity.visualstudio.com/t/Please-implement-P0330R8-Literal-Suffix/10410860#T-N10539586 static constexpr size_t operator""_uz(unsigned long long n) { return size_t(n); } namespace { static thread_local inline PyThreadState* state = nullptr; static inline constexpr auto MAX_STATES = 463; static inline constexpr auto config = PyInterpreterConfig { .use_main_obmalloc = 0, .allow_fork = 0, .allow_exec = 0, .allow_threads = 0, .allow_daemon_threads = 0, .check_multi_interp_extensions = 1, .gil = PyInterpreterConfig_OWN_GIL, }; } /* nameless namespace */ void execute() { std::vector tasks {}; tasks.reserve(MAX_STATES); for (auto count = 0_uz; count < tasks.capacity(); count++) { std::println("Generating thread state {}", count); tasks.emplace_back( [count] { if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) { std::println("Failed to initialize thread state {}", count); return; } auto text = std::format(R"(print("Hello, world! From Thread {}"))", count); auto globals = PyDict_New(); auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input); auto result = PyEval_EvalCode(code, globals, globals); Py_DecRef(result); Py_DecRef(code); Py_DecRef(globals); Py_EndInterpreter(state); state = nullptr; }); } } int main() { PyConfig config {}; PyConfig_InitIsolatedConfig(&config); if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status)) { std::println("Failed to initialize with isolated config: {}", status.err_msg); return EXIT_FAILURE; } PyConfig_Clear(&config); Py_BEGIN_ALLOW_THREADS; execute(); Py_END_ALLOW_THREADS; Py_Finalize(); } ```

ZeroIntensity commented 3 months ago

Unrelated, but I cannot reproduce the data race in qsort under TSan. Instead, TSan reports a segfault, but doesn't say where -- running under a debugger says the segfault is somewhere inside TSan itself, so I suspect there's something fishy happening on that front.

TBBle commented 3 months ago

I may actually just have been too impatient last night with the free-threading test.

If I wait, it seems to eventually complete. There's a bit of a bottleneck in importing: create_builtin->_PyEval_EnableGILTransient->_PyEval_StopTheWorldAll needs to block every thread for every builtin module import in any thread, and this seems to scale badly as more threads complete their initialisation, but does eventually complete AFAICT. (Debug MSVC concurrency primitives are probably not helping performance here either...)

This world-stoppage has the effect of serialising all the new_interpreter calls immediately after their GIL is created, even with a private GIL, since their first action under the new GIL is to import _imp via pycore_interp_init->_PyImport_InitCore which eventually hits create_builtin as above.

There's a note in _PyEval_EnableGILTransient (only using by the import mechanism):

    // This could be an interpreter-local stop-the-world in situations where we
    // know that this interpreter's GIL is not shared, and that it won't become
    // shared before the stop-the-world begins. For now, we always stop all
    // interpreters for simplicity.
    _PyEval_StopTheWorldAll(&_PyRuntime);

so maybe this will be improved in future, since at least during new_interpreter, I think the above conditions hold?

In the real world, this would be a thread-startup cost, and new interpreters already have to call _PyEval_StopTheWorldAll at startup and finalisation to manage the "list of all threads", so it's not awful.

So at this point, I think this issue's original bug is actually "resolved-by-accident in 3.13", pending a new repro case. I'm not sure if it's interesting to fix in 3.12; it only affects debug builds but it is a potential memory-corruption issue. Since the functions were removed in 3.13 and previously deprecated, simply gutting them in 3.12 and complaining loudly if they are called would be viable, perhaps.

The qsort issue should be reproducible in 3.13 but it's quite possible that it has no actual effect despite being detectible by TSan, beyond wasting time with repeated qsort calls on already-sorted data. It depends on how well or badly qsort handles being run on the same array in multiple threads.

bruxisma commented 3 months ago

@TBBle Small note btw, per your comment and the find_package support for Py_GIL_DISABLED=1, but CMake 3.30.3 was released in the last few days and has this as a fix now. 🫡

ZeroIntensity commented 2 months ago

123728 is a documentation fix for the usage of `PyGILState*` in isolated subinterpreters. @TBBle, it might be useful to create a new issue for the allocator problem to get some fresh eyes on it. It's buried in the comments of this issue right now :(

picrin commented 2 months ago

@bruxisma, could you try to run the following code on your windows machine?

I think protecting the creation of a new subinterpreter with an ordinary posix mutex (or whatever C++ code uses, I don't know C++) should work to resolve this. If it does indeed work, it's perhaps worth documenting, and it would be a small ask of the users of the API (unlike for example asking them to hold the global GIL and then after the subinterpreter swaps it in for its own GIL somehow swapping it back to the original GIL to release -- it's a bit of a mess).

I would be curious to know what the results of the test are if you uncomment the mutex locking/unlocking.

A more interesting test case (I think), which I included in the test case below is concurrently hammering each subinterpreter with a lot of small C calls from lots of threads to check if there really is a good isolation of subinterpreters after the creation.

I'm on MacOS and the code works either with or without the mutex:

#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

// Define the number of threads to create
#define NUM_THREADS 500
// Define the upper limit for the range in our Python calculation
#define RANGE_LIMIT 10000000

// Configure the Python interpreter
PyInterpreterConfig config = {
    .check_multi_interp_extensions = 1,  // Check for multi-interpreter compatible extensions
    .gil = PyInterpreterConfig_OWN_GIL,  // Each interpreter gets its own GIL
};

// Define a struct to hold the result of each thread's execution
typedef struct {
    int error_code;
    char error_message[256];
} ThreadResult;

// Declare a global mutex to protect interpreter creation
pthread_mutex_t interpreter_mutex = PTHREAD_MUTEX_INITIALIZER;

// Function that each thread will run
void *run_python_code(void *arg) {
    // Cast the argument to ThreadResult pointer
    ThreadResult *result = (ThreadResult *)arg;
    // Initialize the result
    result->error_code = 0;
    result->error_message[0] = '\0';

    PyThreadState *tstate = NULL;

    // Lock the mutex before creating a new interpreter
    pthread_mutex_lock(&interpreter_mutex);

    // Create a new Python interpreter
    PyStatus status = Py_NewInterpreterFromConfig(&tstate, &config);

    // Unlock the mutex after creating the interpreter
    pthread_mutex_unlock(&interpreter_mutex);

    // Check if interpreter creation was successful
    if (PyStatus_Exception(status)) {
        result->error_code = 1;
        snprintf(result->error_message, sizeof(result->error_message), 
                 "Couldn't spawn new interpreter");
        return NULL;
    }

    // Declare Python objects we'll use
    PyObject *pLocal, *pSum, *pRange, *pIter, *pItem;

    // Create a new dictionary for local variables
    pLocal = PyDict_New();
    // Create a Python long object initialized to 0
    pSum = PyLong_FromLong(0);
    // Add 'sum' to the local dictionary
    PyDict_SetItemString(pLocal, "sum", pSum);

    // Create a range object
    pRange = PyObject_CallFunction(PyObject_GetAttrString(PyImport_AddModule("builtins"), "range"), "l", RANGE_LIMIT);
    // Get an iterator from the range
    pIter = PyObject_GetIter(pRange);

    // Iterate over the range
    while ((pItem = PyIter_Next(pIter))) {
        // Add the current item to the sum
        PyObject *pNewSum = PyNumber_Add(pSum, pItem);
        Py_DECREF(pSum);
        pSum = pNewSum;
        Py_DECREF(pItem);
    }

    // Print the final sum
    PyObject_Print(pSum, stdout, 0);
    printf("\n");

    // Clean up Python objects
    Py_DECREF(pLocal);
    Py_DECREF(pSum);
    Py_DECREF(pRange);
    Py_DECREF(pIter);

    // End the interpreter
    Py_EndInterpreter(tstate);
    return NULL;
}

int main() {
    // Initialize the Python interpreter
    Py_Initialize();

    // Declare arrays to hold thread IDs and results
    pthread_t threads[NUM_THREADS];
    ThreadResult results[NUM_THREADS];
    int num_threads_created = 0;

    // Create threads
    for (int i = 0; i < NUM_THREADS; i++) {
        int ret = pthread_create(&threads[i], NULL, run_python_code, &results[i]);
        if (ret) {
            fprintf(stderr, "Error creating thread %d: %d\n", i, ret);
            break;
        } else {
            printf("Successfully created thread %d:\n", i);
        }
        num_threads_created++;
    }

    // Wait for all created threads to finish
    for (int i = 0; i < num_threads_created; i++) {
        pthread_join(threads[i], NULL);

        // Print any errors that occurred in the thread
        if (results[i].error_code) {
            fprintf(stderr, "Thread %d error: %s\n", i, results[i].error_message);
        }
        printf("Thread %d completed with error code: %d\n", i, results[i].error_code);
    }

    // Destroy the mutex
    pthread_mutex_destroy(&interpreter_mutex);

    // Finalize the Python interpreter
    Py_Finalize();
    return 0;
}

TBBle commented 2 months ago

@picrin Which version(s) of Python did you try this code with? Also, you did test with a debug Python build, correct?

The mutex approach only protects the interpreter creation, which before Python 3.13 always ended up calling _PyMem_SetDefaultAllocator, but since Python 3.13 no longer appear to always do so.

However, other APIs call _PyMem_SetDefaultAllocator without a global lock, and so a repro case that calls one of those APIs in the multi-interpreter stage should be able to reproduce the issue in Python 3.13 or later.

I don't have pthreads set up for Windows so can't test your code myself, but I will note that in Python 3.13, the free-threaded build already implicitly mutexes interpreter creation across all threads because it takes a runtime-wide lock in the import machinery.

picrin commented 2 months ago

It's Python 3.12, that's how I compile the code on my installation of MacOS:

gcc -L /usr/local/Frameworks/Python.framework/Versions/3.12/lib/ -I /usr/local/Frameworks/Python.framework/Versions/3.12/Headers/ -l python3.12 subinterpreters.c

I don't have Python debug build, so it's just a regular python build. I just read the thread more carefully and I realised this only affects the debug build -- sorry!

I think as per your explanation likely this doesn't work as a fix -- especially if the underlying C API call is already doing this.

Thanks for explaining!

ZeroIntensity commented 2 months ago

@TBBle After quite a bit of investigating, it looks like the allocation issue is actually a duplicate! See #116510 (the fix was #113412)

I was finally able to reproduce this issue, but it seems to occur only under specific conditions relating to allocation -- calling append on a list did it in my case.

That's why this doesn't occur on 3.13 -- it's actually fixed there. Unfortunately, there seems to be hesitation to backport the fix because it's quite complicated.

TBBle commented 2 months ago

After a quick look on my phone, i don't thInk that's the same issue, since it happens in release builds, and seems to involve confusion about shared obmalloc state between legacy subinterpreters.

In this issue's case, each interpreter has its own obmalloc state, but in debug builds they resolve down into the runtime's) raw allocator and in this repro, allication and deallocation are mispaired due to the raw allocator being switched by another thread.

I have to look more closely, but I suspect we avoided that other issue in 3.12 by not importing any native modules. Or maybe dyIng too quickly.

ZeroIntensity commented 2 months ago

It might be a separate cause, but from what I can see, isolated subinterpreters are pretty broken on 3.12 -- debug builds possibly just make it easier to occur. I haven't read this entire thread; does this occur on debug builds for 3.13?

TBBle commented 2 months ago

I don't have a repro for 3.13, but I expect it's possible. It only affects debug builds (or embedded Python with a custom raw allocator) due to the nature of the bug.

Here's the summary of this bug as I understand it:

The failure is that a pointer to memory allocated from one raw allocator is being passed to a different raw allocator's free function.
- One allocator is the raw debug allocator, which does checking and memory poisoning. The other is the standard raw allocator which directly calls system malloc/free.
There's two possible symptoms:
1. Pointer allocated with debug allocator is passed to system free: Probably a no-op in many cases (a harmless memory leak), but the MSVC debug allocator in particular detects this and throws a debug break from RtlValidateHeap if you have a debugger attached. This version is the one that almost always reproduced in my testing of this repro.
2. Pointer allocated with malloc is passed to Python debug allocator's deallocation function: This should be caught in _PyMem_DebugCheckAddress and reported via _PyObject_DebugDumpAddress on all platforms. I haven't looked into why this didn't reproduce on Linux for anyone, but I only saw it very rarely in my tests (or if I ignored the RtlValidateHeap failure and continued the run), so maybe we just didn't try hard enough? (Or maybe there was a per-platform code or thread-behaviour difference that defanged this reproduction, e.g., the relevant code might be compiled out or optimised down to a no-op on Linux reducing the repro window to a handful of instructions.)
Only affects debug builds, not release builds.
- This is because the bug is that when certain functions are called, they globally replace the raw allocator pointer with the standard (non-debug) allocator temporarily using _PyMem_SetDefaultAllocator (and a paired reversion function), and other threads may allocate or deallocate during this time.
- The pointer replacement in _PyMem_SetDefaultAllocator still happens in release builds, but since that's normally a no-op, there's no symptom.
- It's easy to trigger in debug builds because as it happens, debug builds end up pointing all allocators at the raw allocator underneath, so every allocation or deallocation during the window of the swapped allocator pointer is affected.
- Using a custom raw allocator in a release build, i.e. in embedded Python, would be able to trigger this bug, but since it would only affect raw allocations, it would trigger much less often.
The current repro case only affects Python 3.12, as the function that triggers this code-path during interpreter creation (_Py_ClearStandardStreamEncoding) was removed in #105154.
- Others _PyMem_SetDefaultAllocator calls still exists in the Python 3.13 code base, but is no longer always called during sub-interpreter creation, so this simple repro case won't reproduce it. I haven't broken out a new bug report for this issue because I haven't had time yet to try and reproduce it in 3.13 with a new repro case.
- It's possible that in 3.13, every caller of _PyMem_SetDefaultAllocator happens during either main-interpreter initialisation or main-interpreter teardown, in which case this bug would not reproduce in 3.13 onwards, entirely by accident.
113412 is unrelated, as that only affects interpreters that shared their objmalloc state, and in this repro, each interpreter has its own obmalloc state (a prerequisite for having separate GILs).
- 116510 is a segfault, and we don't have a segfault here, as we don't have any invalid pointers or double-frees or the like. The WeeChat reproduction case is with the legacy subinterpreter support, which is shared-GIL and shared-obmalloc.

Clarifying note: I have not tried this on Linux, and reading back in the ticket, I don't see that anyone stated they tried the given repro case in a debug build on either Linux or MacOS. So maybe this is not actually Windows-only, just undertested on other platforms. (Does any MacOS or Linux Python distro ship a debug build that this could easily be tested against?)

Late update: A debug build is not necessary, just the debug memory hooks (which function like a custom allocator as I surmised above). As such, I was able to reproduce this on a Python 3.12 Linux package from Ubuntu.

ZeroIntensity commented 2 months ago

each interpreter has its own obmalloc state (a prerequisite for having separate GILs).

I was under the impression that was broken on 3.12, which is what #113412 fixed.

TBBle commented 2 months ago

As far as I can tell (browsing on GitHub), before #113412, all interpreters had their own obmalloc state instance; sub-interpreters that were marked to share the main interpreters state would just never use their own state data, except in cases like #116510 where the sub-interpreter's obmalloc would be unintentionally used (during interpreter cleanup triggering sub-interpreter cleanup, I suspect).

113412's change was to completely remove these unused obmalloc states, by moving the structure from inline in the interpreter state into being a heap object, and leaving a pointer in the structure; in subinterpreters sharing the main obmalloc, the pointer was pointing back to the main obmalloc. The motivation for that change was that these unused obmalloc pools (actually, probably all obmalloc pools) were leaking memory on de-init/re-init cycles, i.e. #113055. It also fixed at-time-unknown segfaults from mismatching-obmalloc pool use of some sort, i.e. #116510.

I suspect that the pivotal fix for the segfault is actually in get_state, as has_own_state has a behaviour of "always returning true during main-interpreter finalisation", which would return the wrong obmalloc instance during main-interpreter finalisation for sub-interpreters that were supposed to share the main interpreter's sate. The simpler get_state can now unconditionally return the interpreter's obmalloc state pointer, and is unaffected by this odd behaviour in has_own_state.

The reason that does not affect this ticket's repro is that in our repro case, each sub-interpreter is already correctly returning true from has_own_state all the time and hence get_state always returned the interpreter's private obmalloc instance, so #113412 would not have affected that; we never used the wrong obmalloc state instance in our repro case. That's why we never saw segfaults from this repro case.

I suspect remaining uses of has_own_state are similarly at risk, but are generally either in asserts or memory-usage-counting, where an incorrect true will either fail to fire an assertion, or over-count memory usage. Either way, I doubt anyone will ever notice it; without trying it, I suspect you cannot call the memory-accounting code (sys.getallocatedblocks()) that uses has_own_state during main-interpreter finalisation, and the caller in the main interpreter finalisation code is called after all sub-interpreters are finalised.

bruxisma commented 2 months ago

It might be a separate cause, but from what I can see, isolated subinterpreters are pretty broken on 3.12 -- debug builds possibly just make it easier to occur. I haven't read this entire thread; does this occur on debug builds for 3.13?

It occurs in debug builds for both 3.13 and 3.14 😔

TBBle commented 2 months ago

It occurs in debug builds for both 3.13 and 3.14 😔

Was this with your original repro? I couldn't trigger it in 3.13.0rc1, but that was before I understood underlying failure, so it's possible that I did something wrong.

Since it's threading-related, how many cores does your test host have?

Late edit: Poking through the 3.13.0rc2 source locally, all calls to _PyMem_SetDefaultAllocator are either in code that asserts itself as pre-init or post-finalize (import related), is in code that is config handling related and hence only happens pre-init, or are in deprecated APIs which would have been removed in #105154 except they are needed for the Stable ABI. (Some calls are in cases that match more than one of those...) So I cannot at this time identify a 3.13 repro for the original repo case by inspection.

As such, I can trivially reproduce the issue in 3.13 if I add a call Py_SetProgramName(L""); after Py_NewInterpreterFromConfig, i.e. in the threaded context. I don't think that's a high-urgency repro case, calling a nominally-removed API; and I can't at this time see a better repro case in 3.13.

python / cpython

Creating an ungodly amount of sub interpreters in a short amount of time causes memory debug assertions. #123134

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

123728 is a documentation fix for the usage of PyGILState* in isolated subinterpreters. @TBBle, it might be useful to create a new issue for the allocator problem to get some fresh eyes on it. It's buried in the comments of this issue right now :(

113412 is unrelated, as that only affects interpreters that shared their objmalloc state, and in this repro, each interpreter has its own obmalloc state (a prerequisite for having separate GILs).

116510 is a segfault, and we don't have a segfault here, as we don't have any invalid pointers or double-frees or the like. The WeeChat reproduction case is with the legacy subinterpreter support, which is shared-GIL and shared-obmalloc.

123728 is a documentation fix for the usage of `PyGILState*` in isolated subinterpreters. @TBBle, it might be useful to create a new issue for the allocator problem to get some fresh eyes on it. It's buried in the comments of this issue right now :(