thierry-martinez / pyml

OCaml bindings for Python
BSD 2-Clause "Simplified" License
187 stars 31 forks source link

OCaml Gc caused Fatal Python Error: Python memory allocator called without holding the GIL #97

Open tysg opened 1 year ago

tysg commented 1 year ago

First of all, thank you for the amazing package! We had this error, detailed below:

Error output:

Fatal Python error: _PyMem_DebugFree: Python memory allocator called without holding the GIL
Python runtime state: initialized

Thread 0x00007f5261293180 (most recent call first):
<no Python frame>

gdb backtrace:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f52621d8537 in __GI_abort () at abort.c:79
#2  0x00007f525991df5f in fatal_error_exit (status=<optimized out>) at ../Python/pylifecycle.c:2201
#3  0x00007f525991eff9 in fatal_error (stream=<optimized out>, header=header@entry=1, prefix=prefix@entry=0x7f5259a780d0 <__func__.13> "_PyMem_DebugFree", msg=msg@entry=0x7f5259a77a00 "Python memory allocator called without holding the GIL", status=status@entry=-1) at ../Python/pylifecycle.c:2285
#4  0x00007f52599217cb in _Py_FatalErrorFunc (func=func@entry=0x7f5259a780d0 <__func__.13> "_PyMem_DebugFree", msg=msg@entry=0x7f5259a77a00 "Python memory allocator called without holding the GIL") at ../Python/pylifecycle.c:2301
#5  0x00007f525981f7e9 in _PyMem_DebugCheckGIL (func=0x7f5259a780d0 <__func__.13> "_PyMem_DebugFree") at ../Objects/obmalloc.c:2320
#6  _PyMem_DebugFree (ctx=0x7f5259cd3ad0 <_PyMem_Debug+48>, ptr=0x7f4ff8178400) at ../Objects/obmalloc.c:2344
#7  0x00007f52598203cf in PyMem_Free (ptr=<optimized out>) at ../Objects/obmalloc.c:629
#8  0x00007f52597ef403 in list_dealloc (op=0x7f4ff816c640) at ../Objects/listobject.c:338
#9  0x000055c72a0fafbe in caml_empty_minor_heap () at minor_gc.c:413
#10 0x000055c72a0fb438 in caml_gc_dispatch () at minor_gc.c:492
#11 0x000055c72a0fb556 in caml_alloc_small_dispatch (wosize=6, flags=3, nallocs=2, encoded_alloc_lens=0x55c72d8ad9d1 <camlCtypes_ptr__frametable+49> "\002\001") at minor_gc.c:539
#12 0x000055c72a115dca in caml_call_gc ()
#13 0x000055c72976f084 in camlCtypes_ptr__add_bytes_391 () at src/ctypes/ctypes_ptr.ml:76
#14 0x000055c72977478b in camlCtypes_memory__$2b$40_742 () at src/ctypes/ctypes_memory.ml:114
#15 0x000055c72964e0f9 in camlOnnx__Wrappers__fun_3581 () at src/ctypes/ctypes_memory.ml:175
#16 0x000055c729b63f23 in camlBase__Array0__init_1279 () at src/array0.ml:88
#17 0x000055c729f59e39 in camlLwt_preemptive__task_819 () at src/unix/lwt_preemptive.ml:184
#18 0x000055c729f59988 in camlLwt_preemptive__worker_loop_534 () at src/unix/lwt_preemptive.ml:104
#19 0x000055c729f857b5 in camlThread__fun_850 () at thread.ml:49
#20 0x000055c72a115f01 in caml_start_program ()
#21 0x000055c72a10c97d in caml_callback_exn (closure=closure@entry=139991788601392, arg=<optimized out>, arg@entry=1) at callback.c:111
#22 0x000055c72a0edcc0 in caml_thread_start (arg=0x55c735c81280) at st_stubs.c:548
#23 0x00007f5262886ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#24 0x00007f52622b1a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

From what I can see, OCaml Gc tried to reclaim memory holding by Python, without holding the GIL. Unfortunately I cannot provide a minimal reproduction, but I can continue to monitor this and report more findings if I have them.

adrien-n commented 3 days ago

I think I'm seeing the same issue. I haven't come up with a minimal reproducer yet however.

The backtrace is different but the core is similar: caml_gc_dispatch -> caml_empty_minor_heap -> dict_dealloc -> _PyInterpreterState_GET, and crash.

I can't think of a work-around at the moment and I'm not sure there's a good way to fix the issue in pyml either. (heh, or wait until we have a python with no gil! ;p )

adrien-n commented 3 days ago

I tried to migrate to ocaml 5, hoping the collections would happen per-thread but that didn't help. It seems that if I have a thread dedicated to python operations and that I Gc.compact () frequently from it, the crashes occur less often. They still occur far too often in practice unfortunately.

adrien-n commented 3 days ago

I tried to change pydecref() in pyml_stubs.c to surround the actual Py_DECREF() with Python_PyGILState_{Ensure,Release}() and unsurprisingly ended up with a deadlock instead, again triggered by the GC. I would have hoped that the OCaml 5 GC would collect values from the same thread they were allocated from but I guess it does not (or maybe not for custom values?): that should at least make it possible to conduct all python operations from a single thread.

adrien-n commented 2 days ago

There's actually a work-around with OCaml 5 it seems: dedicate a domain for the python execution and do everything you can there. That will help for values allocated on the minor heap since they'll be collected from the same domain (and therefore, probably the same OS thread). Therefore you'll want to avoid values being promoted or allocated directly on the major heap and will have to trigger the GC yourself frequently enough (while making sure values can be collected), and maybe also tweak Gc.{custom_minor_ratio,custom_minor_max_size,minor_heap_size}.

Of course this is only working around the issue but it seems to work well enough in my case and hopefully there will be GIL-free python builds widely available in the coming months.