Open MilesCranmer opened 9 months ago
Not sure if https://github.com/python/cpython/pull/97920 is related at all to this?
The Garbage Collector now runs only on the eval breaker mechanism of the Python bytecode evaluation loop instead on object allocations. The GC can also run when :c:func:
PyErr_CheckSignals
is called so C extensions that need to run for a long time without executing any Python code also have a chance to execute the GC periodically.
Since here model.fit()
runs a C extension for a while, I wonder if the "periodic GC" is causing deallocation of some memory referenced by both Julia and Python, so that when PyCall.jl frees the PyObject (here), it so happens that it was already freed by Python.
@pablogsal sorry for the tag but I'm wondering if you might have any intuition for how I could investigate this further?
Not sure if #97920 is related at all to this?
I think is very unlikely that this is related. That commit will make GC runs less common, but it will make them at times that are safer for the runtime (meaning there will be fewer chances of the runtime encountering illegal conditions). The fact that this happens at GC time unfortunately doesn't necessarily point at the GC as this is normally the time when error conditions that already happened in the past (corruption, illegal references or cycles) are discovered as object links are heavily exercised here.
It is very difficult to make any informed suggestion just by the traceback, but here is what i can observe:
__del__
method somewhere.jfptr_pydecref_1039
just calls PyObject_Free
) directly.The bug looks in the extension layer, not in CPython (although it may be possible) so unfortunately without any reproducer that only involves CPython we won't be able to help more
Answering some additional things:
I wonder if the "periodic GC" is causing deallocation of some memory reference
As mentioned before, the periodic GC in 3.12 is running less often, not more. It's only executed when Python executes bytecode or when PyErr_CheckSignals
is called. Both points are totally legal and the runtime should be consistent. Before it was executed potentially on every memory allocation.
referenced by both Julia and Python, so that when PyCall.jl frees the PyObject (here), it so happens that it was already freed by Python.
If a Julia object is visible by Python after being deallocated then that's an error condition on the extension. If an obejct is destroyed the GC should NOT be able to see it.
Looking at the traceback more closely I don't see CPython's GC anywhere. ALl the gc functions are referring to Julia's GC. In particulat the run_finalizers
function is from here:
https://github.com/JuliaLang/julia/blob/1b183b93f4b78f567241b1e7511138798cea6a0d/src/gc.c#L406
So this looks like a extension/julia problem as I don't see any CPython GC calls here.
Thanks very much for the advice. Indeed it sounds like an issue on the Julia side so feel free to close. The traceback is from Julia but I was wondering if it might be some change in the Python GC that might have freed memory which PyCall was expecting to free itself; but as you suggest it appears to be from something else.
The thing I am puzzled about is that this issue occurs only when incrementing Python 3.11 -> 3.12, but the Julia version (or PyCall version) does not seem to affect it, and it manifests as these segfaults from the garbage collection code. I guess I will need to figure out if there are any changes in 3.12 which break assumptions in PyCall. (I don't think it's PySR-specific as it's basically a lightweight wrapper around a few PyJulia calls – it's just the only way I've been able to consistently reproduce this so far.)
Maybe you can run your reproducer under valgrind and that will point to where the memory was allocated or maybe freed twice? You probably need to look closely and filter a lot of false positives but the answer may be there.
Another possibility is to use memory sanitizer, as that normally tells you where the object was allocated.
Thanks. I did a run of valgrind both on the pure Julia side and also the Python version that segfaults. It looks like most of the errors are just related to codegen and package loading (might just be false positives). I don't immediately notice anything stemming from the Python<->Julia interface.
It's odd because running directly from Julia has no errors, but I don't really see anything related to Python in the errors other than maybe the ffi calls.
In particular valgrind says the error is from
==644289== Thread 6:
==644289== Invalid read of size 8
==644289== at 0x17F5EB91: jl_gc_state_set (julia_threads.h:351)
==644289== by 0x17F5EB91: jl_gc_state_set (julia_threads.h:344)
==644289== by 0x17F5EB91: ijl_task_get_next (partr.c:514)
==644289== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==644289==
==644289==
==644289== Process terminating with default action of signal 11 (SIGSEGV)
==644289== Access not within mapped region at address 0x0
==644289== at 0x17F5EB91: jl_gc_state_set (julia_threads.h:351)
==644289== by 0x17F5EB91: jl_gc_state_set (julia_threads.h:344)
==644289== by 0x17F5EB91: ijl_task_get_next (partr.c:514)
==644289== If you believe this happened as a result of a stack
==644289== overflow in your program's main thread (unlikely but
==644289== possible), you can try to increase the size of the
==644289== main thread stack using the --main-stacksize= flag.
==644289== The main thread stack size used in this run was 16777216.
==644289==
I'm going to try rebuilding Julia and Python with a memory sanitizer and maybe that will help figure this out. For the record I only see the segfault when running multi-threaded Julia, so I'm going to try a thread sanitizer too.
Crash report
What happened?
This is a segfault I am seeing on Python 3.12, when trying to use the Python and Julia runtimes simultaneously via the PyJulia package.
It seems like when there is an object that is referenced by both the Julia and Python runtimes, there can be memory access errors. It seems as though Python is trying to free memory which has already been freed in Julia or vice versa.
I am raising the issue here since the issue has only started occurring on Python 3.12, but does not occur on 3.11. The Julia version does not seem to affect this behavior. So I am trying to understand what changes were made to the Python GC that might have triggered this, and if perhaps the GC is more aggressive in some way?
Here is my current MWE based on a package I maintain (PySR) that uses Julia as backend for a Python frontend:
This is the smallest MWE I have been able to create thus far.
I also see the issue in my continuous integration tests on Python 3.12, but never before 3.12: https://github.com/MilesCranmer/PySR/pull/450
For example, in one of those segfaults, I see the following backtrace:
I found this quite odd as it seems as though both the Julia and Python garbage collection are interfering with eachother. Here, it seems as though
PyObject_Free
is trying to free memory that was already freed? Perhaps one of the GCs is trying to free the memory access by the other runtime. Looking at the backtrace, I suppose this could also be an issue with PyCall.jl (which calls Python functions from Julia), although it hasn't occurred in any previous Python version, so I'm not sure where the issue is coming from.Any help is appreciated. I am happy to provide you with as much debugging information as I can, as this issue is quite urgent to fix in the ecosystem of Python <-> Julia packages.
CPython versions tested on:
3.12
Operating systems tested on:
Linux, macOS, Windows
Output from running 'python -VV' on the command line:
Linux test performed on:
Python 3.12.1 (main, Dec 30 2023, 22:23:57) [GCC 8.5.0 20210514 (Red Hat 8.5.0-20)]