Closed douglas-raillard-arm closed 2 months ago
I can take a look at this whenever you get a reproducer.
Recent change related to FutureIter_dealloc(): commit 23192aba206fcba093c7a424f822f77721a7cbb8. Python 3.12.4 contains this change.
Do you reproduce this issue with Python 3.12.3?
I'll give it a go on 3.12.3. On 3.12.4, I hit another symptom of the issue this weekend: instead of crashing on the assert, it sometimes throws the code in an infinite loop (or at least it takes more than ~20h to do a collection). It's not clear if the loop is coming from gc.get_referents() returning some sort of infinite stream of value or if the GC itself is just stuck. Here is the gdb backtrace of the process interrupted at a random point:
#0 0x0000aaaaaabced44 in _PyMem_IsPtrFreed (ptr=<optimized out>) at ./Include/internal/pycore_pymem.h:83
#1 0x0000aaaaaabceccc in _PyObject_IsFreed (op=op@entry=<_asyncio.FutureIter at remote 0xffff369d7b10>) at ./Include/object.h:220
#2 0x0000aaaaaac0c428 in visit_decref (op=<_asyncio.FutureIter at remote 0xffff369d7b10>, parent=0xfffff6427710) at Modules/gcmodule.c:463
#3 0x0000fffff5df4d84 in module_traverse (mod=<optimized out>, visit=0xaaaaaac0c410 <visit_decref>, arg=0xfffff6427710) at ./Modules/_asynciomodule.c:3610
#4 0x0000aaaaaabd600c in module_traverse (m=0xfffff6427710, visit=0xaaaaaac0c410 <visit_decref>, arg=0xfffff6427710) at Objects/moduleobject.c:874
#5 0x0000aaaaaac0b9b0 in subtract_refs (containers=containers@entry=0xaaaaab1a31e8 <_PyRuntime+92624>) at Modules/gcmodule.c:491
#6 0x0000aaaaaac0ad68 in deduce_unreachable (base=base@entry=0xaaaaab1a31e8 <_PyRuntime+92624>, unreachable=unreachable@entry=0xffffffffc1e8) at Modules/gcmodule.c:1116
#7 0x0000aaaaaac091a4 in gc_collect_main (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>, generation=generation@entry=2, n_collected=n_collected@entry=0xffffffffc270,
n_uncollectable=n_uncollectable@entry=0xffffffffc268, nofail=nofail@entry=0) at Modules/gcmodule.c:1242
#8 0x0000aaaaaac07e94 in gc_collect_with_callback (tstate=0xaaaaab200a70 <_PyRuntime+475736>, generation=2) at Modules/gcmodule.c:1426
#9 0x0000aaaaaac07e18 in gc_collect_generations (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>) at Modules/gcmodule.c:1481
#10 0x0000aaaaaac07c94 in _Py_RunGC (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>) at Modules/gcmodule.c:2295
#11 0x0000aaaaaac077e0 in _Py_HandlePending (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>) at Python/ceval_gil.c:1045
#12 0x0000aaaaaabf1eac in _PyEval_EvalFrameDefault (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>, frame=0xfffff74e9930, throwflag=throwflag@entry=0) at Python/ceval.c:834
#13 0x0000aaaaaabf1910 in _PyEval_EvalFrame (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>, frame=<optimized out>, throwflag=throwflag@entry=0)
at ./Include/internal/pycore_ceval.h:89
#14 0x0000aaaaaabee828 in _PyEval_Vector (tstate=0xaaaaab200a70 <_PyRuntime+475736>, func=0xffff369baf90, locals=locals@entry=0x0, args=0xffffffffc550, argcount=1, kwnames=0x0)
at Python/ceval.c:1683
Now trying with Python 3.12.3 (tags/v3.12.3:f6650f9ad73, Aug 6 2024, 10:09:17) [GCC 13.3.1 20240614]
and I get the same "stuck in garbage collector" symptom (I stopped and continued 10 times with random delays and I always land on a backtrace starting with Garbage-collecting
):
Traceback (most recent call first):
Garbage-collecting
File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
return _abc_subclasscheck(cls, subclass)
File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
return _abc_subclasscheck(cls, subclass)
File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
return _abc_subclasscheck(cls, subclass)
File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
return _abc_subclasscheck(cls, subclass)
File "/home/dourai01/cpython/Lib/abc.py", line 119, in __instancecheck__
return _abc_instancecheck(cls, instance)
File "/home/dourai01/lisa/lisa/tests/base.py", line 1168, in predicate
return isinstance(x, TestBundleBase)
File "/home/dourai01/lisa/lisa/tests/base.py", line 1144, in update_refs
if has_class and predicate(obj):
File "/home/dourai01/lisa/lisa/tests/base.py", line 1148, in update_refs
update_refs(sub)
File "/home/dourai01/lisa/lisa/tests/base.py", line 1148, in update_refs
update_refs(sub)
File "/home/dourai01/lisa/lisa/tests/base.py", line 1148, in update_refs
[...]
The C backtrace looks like that:
#0 0x0000aaaaaabce7d8 in _PyMem_IsPtrFreed (ptr=ptr@entry=0xffff368a7490) at ./Include/internal/pycore_pymem.h:83
#1 0x0000aaaaaabce754 in _PyObject_IsFreed (op=op@entry=<_asyncio.FutureIter at remote 0xffff368a7490>) at Objects/object.c:468
#2 0x0000aaaaaac0be50 in visit_decref (op=<_asyncio.FutureIter at remote 0xffff368a7490>, parent=0xfffff64cdcd0) at Modules/gcmodule.c:463
#3 0x0000fffff5d54d84 in module_traverse (mod=<optimized out>, visit=0xaaaaaac0be38 <visit_decref>, arg=0xfffff64cdcd0) at ./Modules/_asynciomodule.c:3596
#4 0x0000aaaaaabd5a8c in module_traverse (m=0xfffff64cdcd0, visit=0xaaaaaac0be38 <visit_decref>, arg=0xfffff64cdcd0) at Objects/moduleobject.c:874
#5 0x0000aaaaaac0b3d8 in subtract_refs (containers=containers@entry=0xaaaaab1a2738 <_PyRuntime+92576>) at Modules/gcmodule.c:491
#6 0x0000aaaaaac0a790 in deduce_unreachable (base=base@entry=0xaaaaab1a2738 <_PyRuntime+92576>, unreachable=unreachable@entry=0xffffffffaeb8) at Modules/gcmodule.c:1116
#7 0x0000aaaaaac08bcc in gc_collect_main (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, generation=generation@entry=2, n_collected=n_collected@entry=0xffffffffaf40,
n_uncollectable=n_uncollectable@entry=0xffffffffaf38, nofail=nofail@entry=0) at Modules/gcmodule.c:1242
#8 0x0000aaaaaac078bc in gc_collect_with_callback (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, generation=2) at Modules/gcmodule.c:1426
#9 0x0000aaaaaac07840 in gc_collect_generations (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>) at Modules/gcmodule.c:1481
#10 0x0000aaaaaac076bc in _Py_RunGC (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>) at Modules/gcmodule.c:2295
#11 0x0000aaaaaac07208 in _Py_HandlePending (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>) at Python/ceval_gil.c:1045
#12 0x0000aaaaaabf1a6c in _PyEval_EvalFrameDefault (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, frame=0xfffff7f0de70, throwflag=throwflag@entry=0) at Python/ceval.c:834
#13 0x0000aaaaaabf1378 in _PyEval_EvalFrame (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, frame=<optimized out>, throwflag=throwflag@entry=0)
at ./Include/internal/pycore_ceval.h:89
#14 0x0000aaaaaabee290 in _PyEval_Vector (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, func=func@entry=0xfffff7839c10, locals=locals@entry=0x0, args=args@entry=0xffffffffb2e8,
argcount=argcount@entry=2, kwnames=kwnames@entry=0x0) at Python/ceval.c:1683
#15 0x0000aaaaaac4663c in _PyFunction_Vectorcall (func=func@entry=<function at remote 0xfffff7839c10>, stack=stack@entry=0xffffffffb2e8, nargsf=nargsf@entry=2, kwnames=kwnames@entry=0x0)
at Objects/call.c:419
#16 0x0000aaaaaac8ce2c in _PyObject_VectorcallTstate (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, callable=callable@entry=<function at remote 0xfffff7839c10>,
args=args@entry=0xffffffffb2e8, nargsf=nargsf@entry=2, kwnames=kwnames@entry=0x0) at ./Include/internal/pycore_call.h:92
#17 0x0000aaaaaac8c134 in method_vectorcall (method=<optimized out>, args=0xffffffffb2f0, nargsf=9223372036854775809, kwnames=0x0) at Objects/classobject.c:61
#18 0x0000aaaaaabdcea8 in _PyObject_VectorcallTstate (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, callable=callable@entry=<method at remote 0xffffcdcfdc10>, args=args@entry=0xffffffffb2f0,
nargsf=nargsf@entry=9223372036854775809, kwnames=kwnames@entry=0x0) at ./Include/internal/pycore_call.h:92
#19 0x0000aaaaaac4c2b4 in PyObject_CallOneArg (func=func@entry=<method at remote 0xffffcdcfdc10>, arg=arg@entry=<type at remote 0xaaaaab0ea7d0>) at Objects/call.c:401
#20 0x0000aaaaaac7fd1c in object_issubclass (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, derived=derived@entry=<type at remote 0xaaaaab0ea7d0>,
cls=cls@entry=<TestBundleMeta(__module__='lisa.tests.base', __doc__='Dummy class used as a base class for all tests.\n\n.. warning:: Arbitrary code can be executed while loading an instance from a YAML or Pickle file. To include untrusted data in YAML, use the !untrusted tag along with a string\n\n.. note:: As a subclass of :class:`lisa.tests.base.TestBundleBase`, this class is considered as "application" and its API is therefore more subject to change than other parts of :mod:`lisa`.', check_from_target=<classmethod at remote 0xffff36a61b80>, FTRACE_CONF=<FtraceConf(_key_desc_path=[], _src_prio=['TestBundle(required)', 'TestBundle'], _src_override={}, _key_map={'buffer-size': {'TestBundle': 10240}, 'events': {'TestBundle': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab440>, 'TestBundle(required)': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab990>}, 'events-namespaces': {'TestBundle': [], 'TestBundle(required)': []}, 'sav...(truncated)) at Objects/abstract.c:2736
#21 0x0000aaaaaac7fae0 in PyObject_IsSubclass (derived=derived@entry=<type at remote 0xaaaaab0ea7d0>,
cls=cls@entry=<TestBundleMeta(__module__='lisa.tests.base', __doc__='Dummy class used as a base class for all tests.\n\n.. warning:: Arbitrary code can be executed while loading an instance from a YAML or Pickle file. To include untrusted data in YAML, use the !untrusted tag along with a string\n\n.. note:: As a subclass of :class:`lisa.tests.base.TestBundleBase`, this class is considered as "application" and its API is therefore more subject to change than other parts of :mod:`lisa`.', check_from_target=<classmethod at remote 0xffff36a61b80>, FTRACE_CONF=<FtraceConf(_key_desc_path=[], _src_prio=['TestBundle(required)', 'TestBundle'], _src_override={}, _key_map={'buffer-size': {'TestBundle': 10240}, 'events': {'TestBundle': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab440>, 'TestBundle(required)': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab990>}, 'events-namespaces': {'TestBundle': [], 'TestBundle(required)': []}, 'sav...(truncated)) at Objects/abstract.c:2758
#22 0x0000aaaaaacbd030 in _abc__abc_subclasscheck_impl (module=module@entry=<module at remote 0xfffff7849190>, self=<optimized out>, subclass=<type at remote 0xaaaaab0ea7d0>)
at ./Modules/_abc.c:779
#23 0x0000aaaaaacbcbf0 in _abc__abc_subclasscheck (module=<module at remote 0xfffff7849190>, args=args@entry=0xfffff7f0de60, nargs=nargs@entry=2) at ./Modules/clinic/_abc.c.h:141
This is not a standalone reproducer as I'm still working on that
Obviously, the debug will be way easier with a reproducer :-)
I've spent some time trying to get a minimal reproducer and could not get it to work. It's possible this is related to running under pytest for some reason. As it stands, the only reproducer I can offer is:
git clone https://gitlab.arm.com/tooling/lisa
cd lisa
# A few packages need to be installed, like python3 or kernelshark. Python
# modules will be installed in a venv at the next step, without touching
# any system-wide install location.
./install_base.sh --install-all
export LISA_PYTHON=
source init_env
2. Run the offending test on an **arm64** platform: `gdb $LISA_PYTHON -ex 'run -m pytest -s tests/test_test_bundle.py'`
EDIT: remove `-ex quit`
A bit of progress with gdb on v3.12.3 : the infinite loop is here:
((futureiterobject*) current)->future
is set to current
, so it keeps going around without making any progress.
EDIT: It looks like module_free_freelists would cause a refcount issue if current->future == current
since it calls PyObject_GC_Del(current);
and then will try with current->future
at the next iteration. That would try to garbage collect the same object multiple times.
I added some asserts on v3.12.3 tag:
diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index a465090bfaa..9b47661b1fc 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -1597,6 +1597,7 @@ FutureIter_dealloc(futureiterobject *it)
if (state->fi_freelist_len < FI_FREELIST_MAXLEN) {
state->fi_freelist_len++;
it->future = (FutureObj*) state->fi_freelist;
+ assert((void*)it->future != (void*)it);
state->fi_freelist = it;
}
else {
@@ -1818,6 +1819,7 @@ future_new_iter(PyObject *fut)
}
it->future = (FutureObj*)Py_NewRef(fut);
+ assert((void*)it->future != (void*)it);
PyObject_GC_Track(it);
return (PyObject*)it;
}
@@ -3595,6 +3597,7 @@ module_traverse(PyObject *mod, visitproc visit, void *arg)
PyObject *current = next;
Py_VISIT(current);
next = (PyObject*) ((futureiterobject*) current)->future;
+ assert((void*)next != (void*)current);
}
return 0;
}
And this one failed to pass:
static void
FutureIter_dealloc(futureiterobject *it)
{
PyTypeObject *tp = Py_TYPE(it);
asyncio_state *state = get_asyncio_state_by_def((PyObject *)it);
PyObject_GC_UnTrack(it);
tp->tp_clear((PyObject *)it);
if (state->fi_freelist_len < FI_FREELIST_MAXLEN) {
state->fi_freelist_len++;
it->future = (FutureObj*) state->fi_freelist;
assert((void*)it->future != (void*)it); // <------- ADDED THIS ASSERT
state->fi_freelist = it;
}
else {
PyObject_GC_Del(it);
Py_DECREF(tp);
}
}
At a very speculative glance, I would guess that the infinite loop problem originates from visited
and obj
, or possibly the context returned by warnings.catch_warnings
, being tracked by the GC. It's difficult to tell without a full reproducer, though.
@ZeroIntensity unfortunately I could not create a reproducer beyond the existing test suite, but I'm happy to apply a patch on cpython and run with that to help diagnose further, provide coredumps etc.
What is quite strange that this only triggers an issue on arm64 where there seemingly is nothing arch-specific in that code I could see. That test has been running on x86-64 since ~2018 without problems (obviously not on Python 3.12 all this time but still).
If this only happens on ARM, then I think this might be an actual bug with the garbage collector. Since we don't have a standalone reproducer, you'll have to help me here a bit:
If this only happens on ARM, then I think this might be an actual bug with the garbage collector.
That was my initial guess, but then the issue seemed to be a logic issue in Modules/_asynciomodule.c . It's hard to know for sure though as it could just be an invariant the gc is supposed to provide that is broken.
Is there a repository where you're running this? If we could run the test suite and reproduce the bug locally, that's still better than no reproducer.
Yes, see https://github.com/python/cpython/issues/122695#issuecomment-2270937727 . You may be able to skip the install_base.sh
part and either just source init_env
or create your own venv and pip install -r devmode_requirements.txt
.
The issue manifests as either the test hanging or sometimes (rarely) triggering a negative refcount check assert.
What line specifically causes this? (as in, if you comment things out, commenting out which line(s) prevents the segfault/infinite loop?)
The gc hangs while executing that function:
https://gitlab.arm.com/tooling/lisa/-/blob/main/lisa/tests/base.py?ref_type=heads#L1128
Sometimes it hangs inside predicate(obj)
that is defined here as can be seen on this Python backtrace: https://github.com/python/cpython/issues/122695#issuecomment-2270831216
But the exact location is kind of variable and probably depends on when the gc decides to run. Always inside update_ref()
though.
Does this happen on other CPython versions (main branch, 3.13, 3.11, etc)
So far I've tested 3.12.3 and 3.12.4 and it fails on both. I don't think I can easily test on 3.13 as some of our dependencies have extension modules that may or may not be compatible with 3.13, but I may be able to hack through the code to still have the offending code run with no dep.
Apologies, my connection is slow at the moment so I've been trying to install it for the past hour. I'll see if I can reproduce it locally whenever it installs. Does lisa
contain any C extensions? (If so, this might be user-error.)
Does lisa contain any C extensions? (If so, this might be user-error.)
not in itself, only via dependencies (e.g. pandas)
On another note, I've tried on v3.12.4 tag with the following patch:
diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index 05e79915ba7..3228aed87cb 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -17,7 +17,7 @@ module _asyncio
/*[clinic end generated code: output=da39a3ee5e6b4b0d input=8fd17862aa989c69]*/
-#define FI_FREELIST_MAXLEN 255
+#define FI_FREELIST_MAXLEN 0
typedef struct futureiterobject futureiterobject;
and that "fixes" the issue. So it's quite clear that this free list either has a problem on its own or something its implementation depends on (e.g. the gc) has an unexpected behavior.
Also, I think I just reproduced on Python 3.12.4 (main, Jun 8 2024, 18:29:57) [GCC 9.4.0]
on x86-64, so it may not be arch specific after all.
Another thing I've observed is that Py_REFCNT(current) == 0
in an infinite loop case where current->future == current
:
https://github.com/python/cpython/blob/8e8a4baf652f6e1cee7acde9d78c4b6154539748/Modules/_asynciomodule.c#L3609
Is a m_traverse
implementation supposed to call Py_VISIT()
on an object with a refcount == 0 ?
That's odd, if something's reference count is zero, it's in freed memory.
Yeah, I had a look at the implementation of gc_get_referents()
and it just stores everything visited in a list, no questions asked on refcounts. The tp_traverse()
implementation for modules also just delegates to m_traverse()
with just an extra condition on the module object itself to avoid an issue at module init, irrelevant here I think.
I'll try to re-run with extra asserts regarding refcount.
stores everything visited in a list, no questions asked on refcounts
Storing it in a list increments the reference count.
That makes sense, but then it would free it when the list gets destroyed, leading to destroying the object again and again every time it lands in such list isn't it ? Also the doc states:
When implementing tp_traverse, only the members that the instance owns (by having strong references to them) must be visited
If the refcount is 0, no one owns any strong reference to it, so this might trigger weird things. But I'm not super familiar with the details of Python GC.
Yup, but it's very difficult to find the cause without a standalone reproducer :(
Yes, that's very annoying. I tried hacking the code to remove the dependencies that were unused in the test path, and the issue disappears ...
It also looks like v3.13 has changed the freelist code to use _Py_FREELIST_*
macros instead in c908d1f87d287a4b3ec58c85b692a7eb617fa6ea. The change seems to be motivated by supporting free-threaded builds.
I wonder if the problem is with FutureObj_clear
on 3.12. Does this branch fix it?
Let me try. I've ran on 3.12.4 with that patch:
diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index 05e79915ba7..964f291eab5 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -1612,6 +1612,7 @@ FutureIter_dealloc(futureiterobject *it)
state->fi_freelist_len++;
it->future = (FutureObj*) state->fi_freelist;
state->fi_freelist = it;
+ assert(Py_REFCNT(state->fi_freelist) > 0);
}
else {
PyObject_GC_Del(it);
@@ -1821,6 +1822,7 @@ future_new_iter(PyObject *fut)
state->fi_freelist_len--;
it = state->fi_freelist;
state->fi_freelist = (futureiterobject*) it->future;
+ assert(Py_REFCNT(state->fi_freelist) > 0);
it->future = NULL;
_Py_NewReference((PyObject*) it);
}
and it triggered the assert line 1615
So that means that FutureIter_dealloc()
is called on a dead object, or that code somewhat destroys it:
PyObject_GC_UnTrack(it);
tp->tp_clear((PyObject *)it);
EDIT: it is expected that tp_dealloc is called with refcnt == 0: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc
it is expected that tp_dealloc is called with refcnt == 0: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc
Oh right, I was being dumb. Does my patch work?
So if I understand correctly:
tp_traverse()
gc.get_referents()
will add them to the list. Once that list is destroyed, their refcount will drop back to 0 again. Since this can happen an arbitrary number of time, I suppose they will be finalized multiple times ?futureiter->future
pointer to point to the previous item. The module_traverse()
function explicitly traverses the list. At the same time, FutureIter_traverse()
also visits futureiter->future
. Is that an issue ?Oh right, I was being dumb. Does my patch work?
still building it
3. So that means that
gc.get_referents()
will add them to the list.
This might be the problem, being finalized twice causes a double free. Though, it's odd that you can't reproduce it.
I think I found the issue:
module_traverse()
visits the freelistgc.get_referents()
adds the references to a list. Since those objects had refcount == 0, the refcount is now 1.FutureIter_dealloc()
on each of them. FutureIter_dealloc()
will add the destroyed futureiter add the front of the list, and store the previous freelist item in its ->future
attribute.gc.get_referents()
path. The objects for which that happens are the objects stored in that freelist. In some cases, the object that will be re-destroyed happens to also be at the top of the freelist. In that case, FutureIter_dealloc()
will put it again at the front of the list. By doing so, it creates the infinite cycle where it->future == future
.Does that make sense ? I'm now running your patch, let's see ...
That makes sense, I'm just skeptical considering the segfault occurs only under very specific conditions -- I would think it would have been more reproducible if that's the problem.
The patch does not seem to work, it gets stuck in module_traverse() the same way.
module_traverse() visits the freelist
That's wrong. A freelist only contains raw uninitialized bytes, not Python objects.
I would think it would have been more reproducible if that's the problem.
Maybe this only triggers when:
EDIT: FutureIter is simply created by iter(asyncio.Future())
I'll defer to @vstinner, I don't know nearly enough about FutureIter
(or the freelist, for that matter) to answer that.
@ZeroIntensity at last, here is your reproducer:
import asyncio
import _asyncio
import gc
async def main():
it = iter(asyncio.Future())
del it
xs = gc.get_referents(_asyncio)
asyncio.run(main())
asyncio.run() and the coroutine can be removed at the cost of getting a warning when creating asyncio.Future()
EDIT: I confirm that this reproduces the issue on 3.12 on any arch. There is no issue on 3.8, 3.9, 3.10 nor 3.11. From staring at the sources, it looks like 3.13 should not be affected since c908d1f87d287a4b3ec58c85b692a7eb617fa6ea removed the Py_VISIT()
on the freelist in module_traverse()
.
Thanks! It is odd that this only happens on 3.12, I'll try to write a patch.
@Eclips4, this issue needs the topic-asyncio
and 3.12
label.
For me, the root issue is that _asyncio
module_traverse()
function visits the free list. It doesn't make sense to me and it should be removed.
cc @kumaraditya303
Would it make sense to add an asset in debug builds in Py_VISIT() to check for non-zero refcnt ? If the cost is too high for debug builds, maybe that could be done at least in gc.get_referents() to make sure we don't return anything half dead
For me, the root issue is that
_asyncio
module_traverse()
function visits the free list
Agreeing, but wouldn't that be a breaking change for a patch release? My solution as of now is to store an extra strong reference to prevent the double-free (it's not a leak, since it still gets deallocated by PyObject_GC_Del
).
Agreeing, but wouldn't that be a breaking change for a patch release?
It's not a breaking change but a bugfix. As you can see, the current code is wrong and causes undefined behavior when gc.get_referrers()/get_referents() is used.
I've created #122834 as a fix, but Victor's way could be better. @douglas-raillard-arm does lisa
rely on those FutureIter
objects being returned by get_referents
?
@ZeroIntensity no, it's completely accidental. I think the link between my root object and async stuff is another object that uses some asyncio closures, that themselves probably somehow refer to the asyncio module and then we land on that free list.
Also, I'm going to replace that code that tries to fix up an object by just building it the right way directly. There is no way to pass any state to the yaml deserializer that builds it but I can probably just use a thread local or context variable to keep that state in my module.
Thanks @douglas-raillard-arm for the bug report, thanks @ZeroIntensity for the analysis and the fix! I close the issue. The 3.12 backport will land soon.
Crash report
What happened?
This is not a standalone reproducer as I'm still working on that, but the code triggering it is:
The issue was triggered on Gentoo on an arm64 machine. For some reason, this works fine on x86-64. I then managed to trigger it again under a different Python 3.12.4 interpreter compiled from the git repo with debug symbols to get a gdb backtrace:
The Python backtrace at this point was deeply recursed in
update_refs()
.CPython versions tested on:
3.12
Operating systems tested on:
Linux
Output from running 'python -VV' on the command line:
Python 3.12.4 (tags/v3.12.4:8e8a4baf652, Aug 2 2024, 18:22:31) [GCC 13.3.1 20240614]
Linked PRs