python / cpython

The Python programming language
https://www.python.org
Other
63.22k stars 30.28k forks source link

asyncio: FutureIter_dealloc() crashes with negative refcount #122695

Closed douglas-raillard-arm closed 2 months ago

douglas-raillard-arm commented 2 months ago

Crash report

What happened?

This is not a standalone reproducer as I'm still working on that, but the code triggering it is:

    @classmethod
    def _get_referred_objs(cls, obj, predicate=lambda x: True):
        visited = set()
        objs = []

        def update_refs(obj):
            obj_id = id(obj)
            # Avoid cycles. Use the id() of the objects directly since the
            # inclusion check is orders of magnitude faster than checking for
            # inclusing on the object directly. It also handles well non hashable
            # objects and broken __eq__ implementations.
            if obj_id in visited:
                return
            else:
                visited.add(obj_id)
                # Filter-out weird objects that end up in the list and that can
                # trigger a coredump on the interpreter
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    has_class = hasattr(obj, '__class__')

                if has_class and predicate(obj):
                    objs.append(obj)

                for sub in gc.get_referents(obj):
                    update_refs(sub)

        update_refs(obj)
        return objs

The issue was triggered on Gentoo on an arm64 machine. For some reason, this works fine on x86-64. I then managed to trigger it again under a different Python 3.12.4 interpreter compiled from the git repo with debug symbols to get a gdb backtrace:

#6  0x0000aaaaaadfd4a0 in _PyObject_AssertFailed (obj=obj@entry=<_asyncio.FutureIter at remote 0xfffff210b3d0>, expr=expr@entry=0x0, 
    msg=msg@entry=0xaaaaaaea5c50 "object has negative ref count", file=<optimized out>, line=<optimized out>, 
    function=function@entry=0xaaaaaaed4928 <__func__.44.lto_priv.1> "_Py_NegativeRefcount") at Objects/object.c:2603
#7  0x0000aaaaaadfd550 in _Py_NegativeRefcount (filename=<optimized out>, lineno=<optimized out>, op=op@entry=<_asyncio.FutureIter at remote 0xfffff210b3d0>) at Objects/object.c:209
#8  0x0000fffff5df4bb0 in Py_DECREF (filename=filename@entry=0xfffff5dfba58 "./Modules/_asynciomodule.c", lineno=lineno@entry=1764, op=<_asyncio.FutureIter at remote 0xfffff210b3d0>)
    at ./Include/object.h:682
#9  0x0000fffff5df59a0 in FutureIter_clear (it=<optimized out>) at ./Modules/_asynciomodule.c:1764
#10 0x0000fffff5dfb57c in FutureIter_dealloc (it=0xfffff2109f50) at ./Modules/_asynciomodule.c:1601
#11 0x0000aaaaaabc9658 in _Py_Dealloc (op=op@entry=<_asyncio.FutureIter at remote 0xfffff2109f50>) at Objects/object.c:2625
#12 0x0000aaaaaabc9acc in Py_DECREF (filename=filename@entry=0xaaaaaae65f90 "./Include/object.h", lineno=lineno@entry=798, op=<_asyncio.FutureIter at remote 0xfffff2109f50>)
    at ./Include/object.h:690
#13 0x0000aaaaaabc9a68 in Py_XDECREF (op=<optimized out>) at ./Include/object.h:798
#14 0x0000aaaaaabc9780 in list_dealloc (op=0xffffdd2bc5a0) at Objects/listobject.c:356
#15 0x0000aaaaaabc9658 in _Py_Dealloc (

The Python backtrace at this point was deeply recursed in update_refs().

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.12.4 (tags/v3.12.4:8e8a4baf652, Aug 2 2024, 18:22:31) [GCC 13.3.1 20240614]

Linked PRs

ZeroIntensity commented 2 months ago

I can take a look at this whenever you get a reproducer.

vstinner commented 2 months ago

Recent change related to FutureIter_dealloc(): commit 23192aba206fcba093c7a424f822f77721a7cbb8. Python 3.12.4 contains this change.

Do you reproduce this issue with Python 3.12.3?

douglas-raillard-arm commented 2 months ago

I'll give it a go on 3.12.3. On 3.12.4, I hit another symptom of the issue this weekend: instead of crashing on the assert, it sometimes throws the code in an infinite loop (or at least it takes more than ~20h to do a collection). It's not clear if the loop is coming from gc.get_referents() returning some sort of infinite stream of value or if the GC itself is just stuck. Here is the gdb backtrace of the process interrupted at a random point:

#0  0x0000aaaaaabced44 in _PyMem_IsPtrFreed (ptr=<optimized out>) at ./Include/internal/pycore_pymem.h:83
#1  0x0000aaaaaabceccc in _PyObject_IsFreed (op=op@entry=<_asyncio.FutureIter at remote 0xffff369d7b10>) at ./Include/object.h:220
#2  0x0000aaaaaac0c428 in visit_decref (op=<_asyncio.FutureIter at remote 0xffff369d7b10>, parent=0xfffff6427710) at Modules/gcmodule.c:463
#3  0x0000fffff5df4d84 in module_traverse (mod=<optimized out>, visit=0xaaaaaac0c410 <visit_decref>, arg=0xfffff6427710) at ./Modules/_asynciomodule.c:3610
#4  0x0000aaaaaabd600c in module_traverse (m=0xfffff6427710, visit=0xaaaaaac0c410 <visit_decref>, arg=0xfffff6427710) at Objects/moduleobject.c:874
#5  0x0000aaaaaac0b9b0 in subtract_refs (containers=containers@entry=0xaaaaab1a31e8 <_PyRuntime+92624>) at Modules/gcmodule.c:491
#6  0x0000aaaaaac0ad68 in deduce_unreachable (base=base@entry=0xaaaaab1a31e8 <_PyRuntime+92624>, unreachable=unreachable@entry=0xffffffffc1e8) at Modules/gcmodule.c:1116
#7  0x0000aaaaaac091a4 in gc_collect_main (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>, generation=generation@entry=2, n_collected=n_collected@entry=0xffffffffc270, 
    n_uncollectable=n_uncollectable@entry=0xffffffffc268, nofail=nofail@entry=0) at Modules/gcmodule.c:1242
#8  0x0000aaaaaac07e94 in gc_collect_with_callback (tstate=0xaaaaab200a70 <_PyRuntime+475736>, generation=2) at Modules/gcmodule.c:1426
#9  0x0000aaaaaac07e18 in gc_collect_generations (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>) at Modules/gcmodule.c:1481
#10 0x0000aaaaaac07c94 in _Py_RunGC (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>) at Modules/gcmodule.c:2295
#11 0x0000aaaaaac077e0 in _Py_HandlePending (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>) at Python/ceval_gil.c:1045
#12 0x0000aaaaaabf1eac in _PyEval_EvalFrameDefault (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>, frame=0xfffff74e9930, throwflag=throwflag@entry=0) at Python/ceval.c:834
#13 0x0000aaaaaabf1910 in _PyEval_EvalFrame (tstate=tstate@entry=0xaaaaab200a70 <_PyRuntime+475736>, frame=<optimized out>, throwflag=throwflag@entry=0)
    at ./Include/internal/pycore_ceval.h:89
#14 0x0000aaaaaabee828 in _PyEval_Vector (tstate=0xaaaaab200a70 <_PyRuntime+475736>, func=0xffff369baf90, locals=locals@entry=0x0, args=0xffffffffc550, argcount=1, kwnames=0x0)
    at Python/ceval.c:1683
douglas-raillard-arm commented 2 months ago

Now trying with Python 3.12.3 (tags/v3.12.3:f6650f9ad73, Aug 6 2024, 10:09:17) [GCC 13.3.1 20240614] and I get the same "stuck in garbage collector" symptom (I stopped and continued 10 times with random delays and I always land on a backtrace starting with Garbage-collecting):

Traceback (most recent call first):
  Garbage-collecting
  File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
  File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
  File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
  File "/home/dourai01/cpython/Lib/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
  File "/home/dourai01/cpython/Lib/abc.py", line 119, in __instancecheck__
    return _abc_instancecheck(cls, instance)
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1168, in predicate
    return isinstance(x, TestBundleBase)
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1144, in update_refs
    if has_class and predicate(obj):
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1148, in update_refs
    update_refs(sub)
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1148, in update_refs
    update_refs(sub)
  File "/home/dourai01/lisa/lisa/tests/base.py", line 1148, in update_refs

[...]

The C backtrace looks like that:

#0  0x0000aaaaaabce7d8 in _PyMem_IsPtrFreed (ptr=ptr@entry=0xffff368a7490) at ./Include/internal/pycore_pymem.h:83
#1  0x0000aaaaaabce754 in _PyObject_IsFreed (op=op@entry=<_asyncio.FutureIter at remote 0xffff368a7490>) at Objects/object.c:468
#2  0x0000aaaaaac0be50 in visit_decref (op=<_asyncio.FutureIter at remote 0xffff368a7490>, parent=0xfffff64cdcd0) at Modules/gcmodule.c:463
#3  0x0000fffff5d54d84 in module_traverse (mod=<optimized out>, visit=0xaaaaaac0be38 <visit_decref>, arg=0xfffff64cdcd0) at ./Modules/_asynciomodule.c:3596
#4  0x0000aaaaaabd5a8c in module_traverse (m=0xfffff64cdcd0, visit=0xaaaaaac0be38 <visit_decref>, arg=0xfffff64cdcd0) at Objects/moduleobject.c:874
#5  0x0000aaaaaac0b3d8 in subtract_refs (containers=containers@entry=0xaaaaab1a2738 <_PyRuntime+92576>) at Modules/gcmodule.c:491
#6  0x0000aaaaaac0a790 in deduce_unreachable (base=base@entry=0xaaaaab1a2738 <_PyRuntime+92576>, unreachable=unreachable@entry=0xffffffffaeb8) at Modules/gcmodule.c:1116
#7  0x0000aaaaaac08bcc in gc_collect_main (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, generation=generation@entry=2, n_collected=n_collected@entry=0xffffffffaf40, 
    n_uncollectable=n_uncollectable@entry=0xffffffffaf38, nofail=nofail@entry=0) at Modules/gcmodule.c:1242
#8  0x0000aaaaaac078bc in gc_collect_with_callback (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, generation=2) at Modules/gcmodule.c:1426
#9  0x0000aaaaaac07840 in gc_collect_generations (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>) at Modules/gcmodule.c:1481
#10 0x0000aaaaaac076bc in _Py_RunGC (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>) at Modules/gcmodule.c:2295
#11 0x0000aaaaaac07208 in _Py_HandlePending (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>) at Python/ceval_gil.c:1045
#12 0x0000aaaaaabf1a6c in _PyEval_EvalFrameDefault (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, frame=0xfffff7f0de70, throwflag=throwflag@entry=0) at Python/ceval.c:834
#13 0x0000aaaaaabf1378 in _PyEval_EvalFrame (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, frame=<optimized out>, throwflag=throwflag@entry=0)
    at ./Include/internal/pycore_ceval.h:89
#14 0x0000aaaaaabee290 in _PyEval_Vector (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, func=func@entry=0xfffff7839c10, locals=locals@entry=0x0, args=args@entry=0xffffffffb2e8, 
    argcount=argcount@entry=2, kwnames=kwnames@entry=0x0) at Python/ceval.c:1683
#15 0x0000aaaaaac4663c in _PyFunction_Vectorcall (func=func@entry=<function at remote 0xfffff7839c10>, stack=stack@entry=0xffffffffb2e8, nargsf=nargsf@entry=2, kwnames=kwnames@entry=0x0)
    at Objects/call.c:419
#16 0x0000aaaaaac8ce2c in _PyObject_VectorcallTstate (tstate=tstate@entry=0xaaaaab1fffc0 <_PyRuntime+475688>, callable=callable@entry=<function at remote 0xfffff7839c10>, 
    args=args@entry=0xffffffffb2e8, nargsf=nargsf@entry=2, kwnames=kwnames@entry=0x0) at ./Include/internal/pycore_call.h:92
#17 0x0000aaaaaac8c134 in method_vectorcall (method=<optimized out>, args=0xffffffffb2f0, nargsf=9223372036854775809, kwnames=0x0) at Objects/classobject.c:61
#18 0x0000aaaaaabdcea8 in _PyObject_VectorcallTstate (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, callable=callable@entry=<method at remote 0xffffcdcfdc10>, args=args@entry=0xffffffffb2f0, 
    nargsf=nargsf@entry=9223372036854775809, kwnames=kwnames@entry=0x0) at ./Include/internal/pycore_call.h:92
#19 0x0000aaaaaac4c2b4 in PyObject_CallOneArg (func=func@entry=<method at remote 0xffffcdcfdc10>, arg=arg@entry=<type at remote 0xaaaaab0ea7d0>) at Objects/call.c:401
#20 0x0000aaaaaac7fd1c in object_issubclass (tstate=0xaaaaab1fffc0 <_PyRuntime+475688>, derived=derived@entry=<type at remote 0xaaaaab0ea7d0>, 
    cls=cls@entry=<TestBundleMeta(__module__='lisa.tests.base', __doc__='Dummy class used as a base class for all tests.\n\n.. warning:: Arbitrary code can be executed while loading an instance from a YAML or Pickle file. To include untrusted data in YAML, use the !untrusted tag along with a string\n\n.. note:: As a subclass of :class:`lisa.tests.base.TestBundleBase`, this class is considered as "application" and its API is therefore more subject to change than other parts of :mod:`lisa`.', check_from_target=<classmethod at remote 0xffff36a61b80>, FTRACE_CONF=<FtraceConf(_key_desc_path=[], _src_prio=['TestBundle(required)', 'TestBundle'], _src_override={}, _key_map={'buffer-size': {'TestBundle': 10240}, 'events': {'TestBundle': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab440>, 'TestBundle(required)': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab990>}, 'events-namespaces': {'TestBundle': [], 'TestBundle(required)': []}, 'sav...(truncated)) at Objects/abstract.c:2736
#21 0x0000aaaaaac7fae0 in PyObject_IsSubclass (derived=derived@entry=<type at remote 0xaaaaab0ea7d0>, 
    cls=cls@entry=<TestBundleMeta(__module__='lisa.tests.base', __doc__='Dummy class used as a base class for all tests.\n\n.. warning:: Arbitrary code can be executed while loading an instance from a YAML or Pickle file. To include untrusted data in YAML, use the !untrusted tag along with a string\n\n.. note:: As a subclass of :class:`lisa.tests.base.TestBundleBase`, this class is considered as "application" and its API is therefore more subject to change than other parts of :mod:`lisa`.', check_from_target=<classmethod at remote 0xffff36a61b80>, FTRACE_CONF=<FtraceConf(_key_desc_path=[], _src_prio=['TestBundle(required)', 'TestBundle'], _src_override={}, _key_map={'buffer-size': {'TestBundle': 10240}, 'events': {'TestBundle': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab440>, 'TestBundle(required)': <AndTraceEventChecker(check=True, checkers=[], op_str='and', prefix_str='') at remote 0xffff368ab990>}, 'events-namespaces': {'TestBundle': [], 'TestBundle(required)': []}, 'sav...(truncated)) at Objects/abstract.c:2758
#22 0x0000aaaaaacbd030 in _abc__abc_subclasscheck_impl (module=module@entry=<module at remote 0xfffff7849190>, self=<optimized out>, subclass=<type at remote 0xaaaaab0ea7d0>)
    at ./Modules/_abc.c:779
#23 0x0000aaaaaacbcbf0 in _abc__abc_subclasscheck (module=<module at remote 0xfffff7849190>, args=args@entry=0xfffff7f0de60, nargs=nargs@entry=2) at ./Modules/clinic/_abc.c.h:141
vstinner commented 2 months ago

This is not a standalone reproducer as I'm still working on that

Obviously, the debug will be way easier with a reproducer :-)

douglas-raillard-arm commented 2 months ago

I've spent some time trying to get a minimal reproducer and could not get it to work. It's possible this is related to running under pytest for some reason. As it stands, the only reproducer I can offer is:

  1. install lisa from the git repo: https://gitlab.arm.com/tooling/lisa
    
    git clone https://gitlab.arm.com/tooling/lisa
    cd lisa
    # A few packages need to be installed, like python3 or kernelshark. Python
    # modules will be installed in a venv at the next step, without touching
    # any system-wide install location.
    ./install_base.sh --install-all

LISA_PYTHON value cannot be a path apparently, otherwise venv gets confused. Add the relevant folder to PATH first and set LISA_PYTHON to the binary filename.

export LISA_PYTHON=

On the first run, it will take care of creating a Python venv and populating it. This will take LISA_PYTHON into account when creating the venv

source init_env


2. Run the offending test on an **arm64** platform: `gdb $LISA_PYTHON -ex 'run -m pytest -s tests/test_test_bundle.py'`

EDIT: remove `-ex quit`
douglas-raillard-arm commented 2 months ago

A bit of progress with gdb on v3.12.3 : the infinite loop is here: ((futureiterobject*) current)->future is set to current, so it keeps going around without making any progress.

EDIT: It looks like module_free_freelists would cause a refcount issue if current->future == current since it calls PyObject_GC_Del(current); and then will try with current->future at the next iteration. That would try to garbage collect the same object multiple times.

douglas-raillard-arm commented 2 months ago

I added some asserts on v3.12.3 tag:

diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index a465090bfaa..9b47661b1fc 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -1597,6 +1597,7 @@ FutureIter_dealloc(futureiterobject *it)
     if (state->fi_freelist_len < FI_FREELIST_MAXLEN) {
         state->fi_freelist_len++;
         it->future = (FutureObj*) state->fi_freelist;
+        assert((void*)it->future != (void*)it);
         state->fi_freelist = it;
     }
     else {
@@ -1818,6 +1819,7 @@ future_new_iter(PyObject *fut)
     }

     it->future = (FutureObj*)Py_NewRef(fut);
+    assert((void*)it->future != (void*)it);
     PyObject_GC_Track(it);
     return (PyObject*)it;
 }
@@ -3595,6 +3597,7 @@ module_traverse(PyObject *mod, visitproc visit, void *arg)
         PyObject *current = next;
         Py_VISIT(current);
         next = (PyObject*) ((futureiterobject*) current)->future;
+        assert((void*)next != (void*)current);
     }
     return 0;
 }

And this one failed to pass:

static void 
FutureIter_dealloc(futureiterobject *it) 
{ 
    PyTypeObject *tp = Py_TYPE(it); 
    asyncio_state *state = get_asyncio_state_by_def((PyObject *)it); 
    PyObject_GC_UnTrack(it); 
    tp->tp_clear((PyObject *)it); 

    if (state->fi_freelist_len < FI_FREELIST_MAXLEN) { 
        state->fi_freelist_len++; 
        it->future = (FutureObj*) state->fi_freelist; 
        assert((void*)it->future != (void*)it); // <------- ADDED THIS ASSERT
        state->fi_freelist = it; 
    } 
    else { 
        PyObject_GC_Del(it); 
        Py_DECREF(tp); 
    } 
} 
ZeroIntensity commented 2 months ago

At a very speculative glance, I would guess that the infinite loop problem originates from visited and obj, or possibly the context returned by warnings.catch_warnings, being tracked by the GC. It's difficult to tell without a full reproducer, though.

douglas-raillard-arm commented 2 months ago

@ZeroIntensity unfortunately I could not create a reproducer beyond the existing test suite, but I'm happy to apply a patch on cpython and run with that to help diagnose further, provide coredumps etc.

What is quite strange that this only triggers an issue on arm64 where there seemingly is nothing arch-specific in that code I could see. That test has been running on x86-64 since ~2018 without problems (obviously not on Python 3.12 all this time but still).

ZeroIntensity commented 2 months ago

If this only happens on ARM, then I think this might be an actual bug with the garbage collector. Since we don't have a standalone reproducer, you'll have to help me here a bit:

douglas-raillard-arm commented 2 months ago

If this only happens on ARM, then I think this might be an actual bug with the garbage collector.

That was my initial guess, but then the issue seemed to be a logic issue in Modules/_asynciomodule.c . It's hard to know for sure though as it could just be an invariant the gc is supposed to provide that is broken.

Is there a repository where you're running this? If we could run the test suite and reproduce the bug locally, that's still better than no reproducer.

Yes, see https://github.com/python/cpython/issues/122695#issuecomment-2270937727 . You may be able to skip the install_base.sh part and either just source init_env or create your own venv and pip install -r devmode_requirements.txt .

The issue manifests as either the test hanging or sometimes (rarely) triggering a negative refcount check assert.

What line specifically causes this? (as in, if you comment things out, commenting out which line(s) prevents the segfault/infinite loop?)

The gc hangs while executing that function: https://gitlab.arm.com/tooling/lisa/-/blob/main/lisa/tests/base.py?ref_type=heads#L1128 Sometimes it hangs inside predicate(obj) that is defined here as can be seen on this Python backtrace: https://github.com/python/cpython/issues/122695#issuecomment-2270831216

But the exact location is kind of variable and probably depends on when the gc decides to run. Always inside update_ref() though.

Does this happen on other CPython versions (main branch, 3.13, 3.11, etc)

So far I've tested 3.12.3 and 3.12.4 and it fails on both. I don't think I can easily test on 3.13 as some of our dependencies have extension modules that may or may not be compatible with 3.13, but I may be able to hack through the code to still have the offending code run with no dep.

ZeroIntensity commented 2 months ago

Apologies, my connection is slow at the moment so I've been trying to install it for the past hour. I'll see if I can reproduce it locally whenever it installs. Does lisa contain any C extensions? (If so, this might be user-error.)

douglas-raillard-arm commented 2 months ago

Does lisa contain any C extensions? (If so, this might be user-error.)

not in itself, only via dependencies (e.g. pandas)

On another note, I've tried on v3.12.4 tag with the following patch:

diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index 05e79915ba7..3228aed87cb 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -17,7 +17,7 @@ module _asyncio
 /*[clinic end generated code: output=da39a3ee5e6b4b0d input=8fd17862aa989c69]*/

-#define FI_FREELIST_MAXLEN 255
+#define FI_FREELIST_MAXLEN 0

 typedef struct futureiterobject futureiterobject;

and that "fixes" the issue. So it's quite clear that this free list either has a problem on its own or something its implementation depends on (e.g. the gc) has an unexpected behavior.

douglas-raillard-arm commented 2 months ago

Also, I think I just reproduced on Python 3.12.4 (main, Jun 8 2024, 18:29:57) [GCC 9.4.0] on x86-64, so it may not be arch specific after all.

douglas-raillard-arm commented 2 months ago

Another thing I've observed is that Py_REFCNT(current) == 0 in an infinite loop case where current->future == current: https://github.com/python/cpython/blob/8e8a4baf652f6e1cee7acde9d78c4b6154539748/Modules/_asynciomodule.c#L3609

Is a m_traverse implementation supposed to call Py_VISIT() on an object with a refcount == 0 ?

ZeroIntensity commented 2 months ago

That's odd, if something's reference count is zero, it's in freed memory.

douglas-raillard-arm commented 2 months ago

Yeah, I had a look at the implementation of gc_get_referents() and it just stores everything visited in a list, no questions asked on refcounts. The tp_traverse() implementation for modules also just delegates to m_traverse() with just an extra condition on the module object itself to avoid an issue at module init, irrelevant here I think.

I'll try to re-run with extra asserts regarding refcount.

ZeroIntensity commented 2 months ago

stores everything visited in a list, no questions asked on refcounts

Storing it in a list increments the reference count.

douglas-raillard-arm commented 2 months ago

That makes sense, but then it would free it when the list gets destroyed, leading to destroying the object again and again every time it lands in such list isn't it ? Also the doc states:

When implementing tp_traverse, only the members that the instance owns (by having strong references to them) must be visited

If the refcount is 0, no one owns any strong reference to it, so this might trigger weird things. But I'm not super familiar with the details of Python GC.

ZeroIntensity commented 2 months ago

Yup, but it's very difficult to find the cause without a standalone reproducer :(

douglas-raillard-arm commented 2 months ago

Yes, that's very annoying. I tried hacking the code to remove the dependencies that were unused in the test path, and the issue disappears ...

It also looks like v3.13 has changed the freelist code to use _Py_FREELIST_* macros instead in c908d1f87d287a4b3ec58c85b692a7eb617fa6ea. The change seems to be motivated by supporting free-threaded builds.

ZeroIntensity commented 2 months ago

I wonder if the problem is with FutureObj_clear on 3.12. Does this branch fix it?

douglas-raillard-arm commented 2 months ago

Let me try. I've ran on 3.12.4 with that patch:

diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index 05e79915ba7..964f291eab5 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -1612,6 +1612,7 @@ FutureIter_dealloc(futureiterobject *it)
         state->fi_freelist_len++;
         it->future = (FutureObj*) state->fi_freelist;
         state->fi_freelist = it;
+       assert(Py_REFCNT(state->fi_freelist) > 0);
     }
     else {
         PyObject_GC_Del(it);
@@ -1821,6 +1822,7 @@ future_new_iter(PyObject *fut)
         state->fi_freelist_len--;
         it = state->fi_freelist;
         state->fi_freelist = (futureiterobject*) it->future;
+       assert(Py_REFCNT(state->fi_freelist) > 0);
         it->future = NULL;
         _Py_NewReference((PyObject*) it);
     }

and it triggered the assert line 1615

So that means that FutureIter_dealloc() is called on a dead object, or that code somewhat destroys it:

    PyObject_GC_UnTrack(it);
    tp->tp_clear((PyObject *)it);

EDIT: it is expected that tp_dealloc is called with refcnt == 0: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc

ZeroIntensity commented 2 months ago

it is expected that tp_dealloc is called with refcnt == 0: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc

Oh right, I was being dumb. Does my patch work?

douglas-raillard-arm commented 2 months ago

So if I understand correctly:

  1. the freelist keeps objects with refcount == 0 by design
  2. However, it Py_VISIT() them in the module's tp_traverse()
  3. So that means that gc.get_referents() will add them to the list. Once that list is destroyed, their refcount will drop back to 0 again. Since this can happen an arbitrary number of time, I suppose they will be finalized multiple times ?
  4. The freelist seems to be a linked list using the futureiter->future pointer to point to the previous item. The module_traverse() function explicitly traverses the list. At the same time, FutureIter_traverse() also visits futureiter->future. Is that an issue ?
douglas-raillard-arm commented 2 months ago

Oh right, I was being dumb. Does my patch work?

still building it

ZeroIntensity commented 2 months ago

3. So that means that gc.get_referents() will add them to the list.

This might be the problem, being finalized twice causes a double free. Though, it's odd that you can't reproduce it.

douglas-raillard-arm commented 2 months ago

I think I found the issue:

  1. module_traverse() visits the freelist
  2. gc.get_referents() adds the references to a list. Since those objects had refcount == 0, the refcount is now 1.
  3. the user code drops the list. The refcount of the items goes back to 0, thereby calling FutureIter_dealloc() on each of them.
  4. As long as the freelist is small enough, FutureIter_dealloc() will add the destroyed futureiter add the front of the list, and store the previous freelist item in its ->future attribute.
  5. Since we end up running that dealloc function every time the refcount drops to 0, this can happen multiple times via the gc.get_referents() path. The objects for which that happens are the objects stored in that freelist. In some cases, the object that will be re-destroyed happens to also be at the top of the freelist. In that case, FutureIter_dealloc() will put it again at the front of the list. By doing so, it creates the infinite cycle where it->future == future.

Does that make sense ? I'm now running your patch, let's see ...

ZeroIntensity commented 2 months ago

That makes sense, I'm just skeptical considering the segfault occurs only under very specific conditions -- I would think it would have been more reproducible if that's the problem.

douglas-raillard-arm commented 2 months ago

The patch does not seem to work, it gets stuck in module_traverse() the same way.

vstinner commented 2 months ago

module_traverse() visits the freelist

That's wrong. A freelist only contains raw uninitialized bytes, not Python objects.

douglas-raillard-arm commented 2 months ago

I would think it would have been more reproducible if that's the problem.

Maybe this only triggers when:

  1. there is something in that freelist
  2. on 3.12 Now that I think about it, the only place where we have used 3.12 were user machines, nowhere in the CI yet. I'll try to make a reproducer with that in mind. I suppose a futureiter is created every time you await on a coroutine somehow ?

EDIT: FutureIter is simply created by iter(asyncio.Future())

ZeroIntensity commented 2 months ago

I'll defer to @vstinner, I don't know nearly enough about FutureIter (or the freelist, for that matter) to answer that.

douglas-raillard-arm commented 2 months ago

@ZeroIntensity at last, here is your reproducer:

import asyncio
import _asyncio
import gc

async def main():

    it = iter(asyncio.Future())
    del it
    xs = gc.get_referents(_asyncio)

asyncio.run(main())

asyncio.run() and the coroutine can be removed at the cost of getting a warning when creating asyncio.Future()

EDIT: I confirm that this reproduces the issue on 3.12 on any arch. There is no issue on 3.8, 3.9, 3.10 nor 3.11. From staring at the sources, it looks like 3.13 should not be affected since c908d1f87d287a4b3ec58c85b692a7eb617fa6ea removed the Py_VISIT() on the freelist in module_traverse().

ZeroIntensity commented 2 months ago

Thanks! It is odd that this only happens on 3.12, I'll try to write a patch.

@Eclips4, this issue needs the topic-asyncio and 3.12 label.

vstinner commented 2 months ago

For me, the root issue is that _asyncio module_traverse() function visits the free list. It doesn't make sense to me and it should be removed.

cc @kumaraditya303

douglas-raillard-arm commented 2 months ago

Would it make sense to add an asset in debug builds in Py_VISIT() to check for non-zero refcnt ? If the cost is too high for debug builds, maybe that could be done at least in gc.get_referents() to make sure we don't return anything half dead

ZeroIntensity commented 2 months ago

For me, the root issue is that _asyncio module_traverse() function visits the free list

Agreeing, but wouldn't that be a breaking change for a patch release? My solution as of now is to store an extra strong reference to prevent the double-free (it's not a leak, since it still gets deallocated by PyObject_GC_Del).

vstinner commented 2 months ago

Agreeing, but wouldn't that be a breaking change for a patch release?

It's not a breaking change but a bugfix. As you can see, the current code is wrong and causes undefined behavior when gc.get_referrers()/get_referents() is used.

ZeroIntensity commented 2 months ago

I've created #122834 as a fix, but Victor's way could be better. @douglas-raillard-arm does lisa rely on those FutureIter objects being returned by get_referents?

douglas-raillard-arm commented 2 months ago

@ZeroIntensity no, it's completely accidental. I think the link between my root object and async stuff is another object that uses some asyncio closures, that themselves probably somehow refer to the asyncio module and then we land on that free list.

Also, I'm going to replace that code that tries to fix up an object by just building it the right way directly. There is no way to pass any state to the yaml deserializer that builds it but I can probably just use a thread local or context variable to keep that state in my module.

vstinner commented 2 months ago

Thanks @douglas-raillard-arm for the bug report, thanks @ZeroIntensity for the analysis and the fix! I close the issue. The 3.12 backport will land soon.