Fatal Python error: _PyMem_DebugMalloc: Python memory allocator called without holding the GIL

jamadden commented 8 months ago

It appears that the background threads do not ensure they hold the GIL when they call Python APIs. When the interpreter is running in dev mode (PYTHONDEVMODE=1 or -X dev) the interpreter detects this and kills the process; in non-dev mode, the results are undefined.

Tested with Python 3.10, 3.11 and rapidgzip 0.10.3 on macOS 13.6.1/Apple Silicon (I was unable to produce this same result on Python 3.12 because it segfaults; that may be a symptom of the same problem, however.)

Given this example script:

# test.py
import rapidgzip
import io
import gzip

buf = io.BytesIO()

with gzip.open(buf, mode='wb') as f:
    f.write(b'Now is the winter of our discontent')

buf.seek(0)

fobj = rapidgzip.RapidgzipFile(buf)
fobj.block_offsets()
print('SUCCESS')

You should be able to crash (or not) the process like so:

$ PYTHONDEVMODE= python3.10 test.py
SUCCESS
$ PYTHONDEVMODE=1 python3.10 /tmp/test.py
Fatal Python error: _PyMem_DebugMalloc: Python memory allocator called without holding the GIL
Python runtime state: initialized

Thread 0x00000001e059a080 (most recent call first):
  File "/tmp/test.py", line 14 in <module>

Extension modules: rapidgzip (total: 1)
Abort trap: 6

This produces a core with the following stack traces:

Thread 0::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib                 0x1854cf730 __psynch_cvwait + 8
1   libsystem_pthread.dylib                0x18550c574 _pthread_cond_wait + 1232
2   libc++.1.dylib                         0x185434f84 std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) + 100
3   rapidgzip.cpython-310-darwin.so        0x1013525e8 std::__1::future_status std::__1::__assoc_sub_state::wait_until<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>(std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) const + 152
4   Python                                 0x1008b9d00 call_function + 124
5   Python                                 0x1008b3248 _PyEval_EvalFrameDefault + 16136
6   Python                                 0x1008ae3f8 _PyEval_Vector + 368
7   Python                                 0x1008ae274 PyEval_EvalCode + 104
8   Python                                 0x1008f9be0 run_eval_code_obj + 84
9   Python                                 0x1008f9b44 run_mod + 112
10  Python                                 0x1008f9968 pyrun_file + 148
11  Python                                 0x1008f93b8 _PyRun_SimpleFileObject + 268
12  Python                                 0x1008f8d6c _PyRun_AnyFileObject + 216
13  Python                                 0x10091348c pymain_run_file_obj + 220
14  Python                                 0x100912de4 pymain_run_file + 72
15  Python                                 0x100912740 Py_RunMain + 856
16  Python                                 0x1009137d8 Py_BytesMain + 40
17  dyld                                   0x1851b3f28 start + 2236

Thread 1 Crashed:
0   libsystem_kernel.dylib                 0x1854d4744 __pthread_kill + 8
1   libsystem_pthread.dylib                0x18550bc28 pthread_kill + 288
2   libsystem_c.dylib                      0x185419ae8 abort + 180
3   Python                                 0x1008f3ff8 fatal_error_exit + 16
4   Python                                 0x1008f3e4c fatal_error + 44
5   Python                                 0x1008f609c _Py_FatalErrorFunc + 56
6   Python                                 0x100832dcc _PyMem_DebugFree + 68
7   Python                                 0x10096a028 _PyFaulthandler_Fini + 172
8   Python                                 0x1008f3f5c fatal_error + 316
9   Python                                 0x1008f609c _Py_FatalErrorFunc + 56
10  Python                                 0x100832cd8 _PyMem_DebugMalloc + 144
11  Python                                 0x100915aec _PyObject_GC_Alloc + 68
12  Python                                 0x100915ce0 _PyObject_GC_NewVar + 56
13  Python                                 0x100841158 PyTuple_Pack + 368
14  rapidgzip.cpython-310-darwin.so        0x1013ac724 PythonFileReader::read(char*, unsigned long) + 68
15  rapidgzip.cpython-310-darwin.so        0x10135f6cc rapidgzip::GzipChunkFetcher<FetchingStrategy::FetchMultiStream, rapidgzip::ChunkData, false>::decodeBlock(unsigned long, unsigned long) const + 384
16  libsystem_pthread.dylib                0x185506da0 thread_start + 8

Using a more complicated underlying stream (from the smart-open package) I've also seen a crash like this:

Thread 4 Crashed:
0   Python                                 0x102d0aa50 _PyErr_FormatVFromCause + 224
1   Python                                 0x102d0a960 _PyErr_FormatFromCauseTstate + 28
2   Python                                 0x102c238c0 _Py_CheckFunctionResult + 136
3   Python                                 0x102cf343c _PyEval_EvalFrameDefault + 33020
4   Python                                 0x102cea3f8 _PyEval_Vector + 368
5   Python                                 0x102cf5d00 call_function + 124
6   Python                                 0x102cee840 _PyEval_EvalFrameDefault + 13568
7   Python                                 0x102cea3f8 _PyEval_Vector + 368
8   Python                                 0x102cf5d00 call_function + 124
9   Python                                 0x102cee840 _PyEval_EvalFrameDefault + 13568
10  Python                                 0x102cea3f8 _PyEval_Vector + 368
11  Python                                 0x102cf5d00 call_function + 124
12  Python                                 0x102cee840 _PyEval_EvalFrameDefault + 13568
13  Python                                 0x102cea3f8 _PyEval_Vector + 368
14  Python                                 0x102c26970 method_vectorcall + 124
15  Python                                 0x102c243e8 PyVectorcall_Call + 176
16  Python                                 0x102cee7a0 _PyEval_EvalFrameDefault + 13408
17  Python                                 0x102cea3f8 _PyEval_Vector + 368
18  Python                                 0x102c26970 method_vectorcall + 124
19  Python                                 0x102cf5d00 call_function + 124
20  Python                                 0x102cebf9c _PyEval_EvalFrameDefault + 3164
21  Python                                 0x102cea3f8 _PyEval_Vector + 368
22  Python                                 0x102c26970 method_vectorcall + 124
23  Python                                 0x102cf5d00 call_function + 124
24  Python                                 0x102cebf9c _PyEval_EvalFrameDefault + 3164
25  Python                                 0x102cea3f8 _PyEval_Vector + 368
26  Python                                 0x102c26970 method_vectorcall + 124
27  Python                                 0x102cf5d00 call_function + 124
28  Python                                 0x102cef248 _PyEval_EvalFrameDefault + 16136
29  Python                                 0x102cea3f8 _PyEval_Vector + 368
30  Python                                 0x102c26a1c method_vectorcall + 296
31  Python                                 0x102d77ba0 partial_vectorcall + 560
32  Python                                 0x102c243e8 PyVectorcall_Call + 176
33  Python                                 0x102cee7a0 _PyEval_EvalFrameDefault + 13408
34  Python                                 0x102cea3f8 _PyEval_Vector + 368
35  Python                                 0x102cf5d00 call_function + 124
36  Python                                 0x102cec098 _PyEval_EvalFrameDefault + 3416
37  Python                                 0x102cea3f8 _PyEval_Vector + 368
38  Python                                 0x102cf5d00 call_function + 124
39  Python                                 0x102cee840 _PyEval_EvalFrameDefault + 13568
40  Python                                 0x102cea3f8 _PyEval_Vector + 368
41  Python                                 0x102cf5d00 call_function + 124
42  Python                                 0x102cee840 _PyEval_EvalFrameDefault + 13568
43  Python                                 0x102cea3f8 _PyEval_Vector + 368
44  Python                                 0x102c26a1c method_vectorcall + 296
45  rapidgzip.cpython-310-darwin.so        0x109858e5c unsigned long callPyObject<unsigned long, long long, int>(_object*, long long, int) + 72
46  rapidgzip.cpython-310-darwin.so        0x10980c624 rapidgzip::GzipChunkFetcher<FetchingStrategy::FetchMultiStream, rapidgzip::ChunkData, false>::decodeBlock(unsigned long, unsigned long) const + 4312
47  libsystem_pthread.dylib                0x185506da0 thread_start + 8

Finally, here's what the Python 3.12 segfault looks like for that same test program:

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   Python [98976]

VM Region Info: 0x10 is not in any region.  Bytes before following region: 105553518919664
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->  
      MALLOC_NANO (reserved)   600018000000-600020000000 [128.0M] rw-/rwx SM=NUL  ...(unallocated)

Thread 0::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib                 0x1854cf730 __psynch_cvwait + 8
1   libsystem_pthread.dylib                0x18550c574 _pthread_cond_wait + 1232
2   libc++.1.dylib                         0x185434f84 std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) + 100
3   rapidgzip.cpython-312-darwin.so        0x103fe9708 std::__1::future_status std::__1::__assoc_sub_state::wait_until<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>(std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) const + 152

Thread 1 Crashed:
0   libsystem_kernel.dylib                 0x1854d4744 __pthread_kill + 8
1   libsystem_pthread.dylib                0x18550bc28 pthread_kill + 288
2   libsystem_c.dylib                      0x1853e246c raise + 32
3   Python                                 0x103344358 faulthandler_fatal_error + 360
4   ???                             0xffff8001034c3998 ???
5   rapidgzip.cpython-312-darwin.so        0x103ff67ec rapidgzip::GzipChunkFetcher<FetchingStrategy::FetchMultiStream, rapidgzip::ChunkData, false>::decodeBlock(unsigned long, unsigned long) const + 384
6   libsystem_pthread.dylib                0x185506da0 thread_start + 8

mxmlnkn commented 8 months ago

Thank you for reporting this. I will have to read up on how to correctly use the GIL. It seems that pybind11 is doing some related things. I'm not using pybind11. Maybe I could use it or if not, it's still nice to have a working example. All of this needs to be done inside src/core/filereader/Python.hpp, which should be the only piece of code that calls into Python.

jamadden commented 8 months ago

In a nutshell, you call PyGILState_Ensure() before calling into Python from an arbitrary thread, and you call PyGILState_Release() when you're done calling into Python.

My gevent library (written in C) uses some macros. A function that could be called from anywhere, anytime, looks like this:

#define GGIL_DECLARE  PyGILState_STATE ___save
#define GGIL_ENSURE  ___save = PyGILState_Ensure();
#define GGIL_RELEASE  PyGILState_Release(___save);

static void some_function() 
{
   GGIL_DECLARE;
   int i; /* other variable declarations; we were trying to be C89 compliant */
   /* Now the body of the function might use Python APIs, so ensure we have the GIL. */
  GGIL_ENSURE;
     /* now do stuff */
  GGIL_RELEASE;
}

But since this project is in C++, it has the option of using the RAII pattern and declaring an instance of a custom class that "ensures" in the constructor and "releases" in the destructor.

One small note of caution: Obtaining the GIL can be expensive, so, whenever possible, it's best to do it at a relatively high level, instead of in every function that calls the Python API.

That is, instead of this:

// assuming this is the RAII class
class EnsureGIL;
static void func1()
{
   EnsureGIL g;
   // do stuff 
}
static void func2()
{
  EnsureGIL g;
  // do stuff 
}
static void driver()
{
  func1()
  func2()
}

it would be better to have the driver function handle the GIL.

static void func1()
{
   // do stuff 
}
static void func2()
{
  // do stuff 
}
static void driver()
{
  EnsureGIL g;
  func1()
  func2()
}

That can be easier said than done, though, unfortunately.

mxmlnkn commented 7 months ago

That can be easier said than done, though, unfortunately.

Like always...

I tried around a bit and can reproduce it thanks to your detailed bug report. I also have a simple RAII helper that looks like this:

class ScopedGILLock
{
public:
    ScopedGILLock()
    {
        if ( m_referenceCounter++ == 0 ) {
            m_state = PyGILState_Ensure();
        }
    }

    ~ScopedGILLock()
    {
        if ( m_referenceCounter == 0 ) {
            return;
        }

        if ( --m_referenceCounter == 0 ) {
            PyGILState_Release( m_state );
            m_state = {};
        }
    }

private:
    inline static thread_local PyGILState_STATE m_state{};
    inline static thread_local size_t m_referenceCounter{ 0 };
};

The problem now is that it locks when called from a child thread because the main thread has the GIL from the beginning because it was called from Python. At this point, I honestly question how useful that PYTHONDEVMODE error is. Why do I have to lock the GIL on the other thread when I already own it on the main thread and can ensure, thanks to SharedFileReader, which wraps PythonFileReader, that all calls to Python are serialized? Well, I see the usefulness of PYTHONDEVMODE=1 and would like to enable it on my CI tests.

The question is now how to best release the GIL on the main thread. I don't want to do that inside each method that is exposed to Python because it feels like a deadlock bug waiting to happen, e.g., when adding another method in the future.

Edit: Well, I guess only methods that wait for worker threads to do something would potentially be required to release the GIL. And thanks to the relatively slim interface, this should only affect the read method.

mxmlnkn / rapidgzip

Fatal Python error: _PyMem_DebugMalloc: Python memory allocator called without holding the GIL #24