python / cpython

The Python programming language
https://www.python.org
Other
63.34k stars 30.32k forks source link

Crash when generator frame proxies outlive their generator #125723

Open ncoghlan opened 2 weeks ago

ncoghlan commented 2 weeks ago

Crash report

What happened?

Checking some frame proxy behaviour at the interactive prompt, I encountered the following crash:

$ PYTHON_BASIC_REPL=1 python3.13
Python 3.13.0 (main, Oct  8 2024, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> def g():
...     a = 1
...     yield locals(), sys._getframe().f_locals
...
>>> snapshot, live_locals = next(g())
>>> print(snapshot); print(live_locals)
{'a': 1}
{'a': b'print'}
>>> snapshot, live_locals = next(g())
>>> print(snapshot); print(live_locals)
{'a': 1}
{'a': 'input_trans_stack'}
>>> snapshot, live_locals = next(g())
>>> print(snapshot); print(live_locals)
{'a': 1}
Segmentation fault (core dumped)

(Crash was initially seen in the new REPL, the above reproduction in the basic REPL showed it wasn't specific to the new REPL).

Subsequent investigation suggests that the problem relates to the frame proxy outliving the eval loop that created it, as the following script was able to reliably reproduce the crash (as shown below):

import sys
def g():
    a = 1
    yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
    exec("snapshot, live_locals = next(g())", locals=ns)
    print(ns)
$ python3.13 ../_misc/gen_locals_exec.py
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': '_pickle.cpython-313-x86_64-linux-gnu.so'}}
Segmentation fault (core dumped)

Changing the code to explicitly keep the generator object alive:

import sys
def g():
    a = 1
    yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
    gen = g()
    exec("snapshot, live_locals = next(gen)", locals=ns)
    print(ns)

is sufficient to eliminate the crash:

$ python3.13 ../_misc/gen_locals_exec.py
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'snapshot': {'a': 1}, 'live_locals': {'a': 1}}

The crash is still eliminated, even when gen is created via exec rather than creating it in the main eval loop:

import sys
def g():
    a = 1
    yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
    exec("gen = g()", locals=ns)
    exec("snapshot, live_locals = next(gen)", locals=ns)
    print(ns)
$ python3.13 ../_misc/gen_locals_exec.py
{'gen': <generator object g at 0x7f8da7d81b40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81e40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81c00>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81b40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81e40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81c00>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81b40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81e40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81c00>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}
{'gen': <generator object g at 0x7f8da7d81b40>, 'snapshot': {'a': 1}, 'live_locals': {'a': 1}}

CPython versions tested on:

3.13, CPython main branch

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

No response

ncoghlan commented 2 weeks ago

@gaogaotiantian @markshannon Genuinely strange interaction here when frame proxies for generator objects outlive both their generator and their eval loop. Hard to hit in normal code, but pretty easy to hit at an interactive prompt (assuming you're messing about with locals proxies interactively for some reason).

I assume it relates to something going wrong with the handoff of frame state from the eval loop to the generator object (given the garbage data being emitted, maybe the frame proxy pointer is still referencing the no longer valid loop frame?) (Note: it's literally been years since I looked at that code, so I don't actually know how it currently works. This assumption is only based on a vague memory of how it used to work around 3.10 or so)

ncoghlan commented 1 week ago

Definitely looks to be some kind of refcounting issue when the proxy is the only thing still referencing the underlying frame:

>>> import sys
... def g():
...     a = 1
...     yield sys._getframe()
...
>>> gen = g(); gen_frame = next(gen); gen_locals = gen_frame.f_locals
>>> gen_locals
{'a': 1}
>>> del gen, gen_frame; gen_locals
{}
>>> gen = g(); gen_frame = next(gen); gen_locals = gen_frame.f_locals
>>> gen_locals
{'a': 1}
>>> del gen, gen_frame; gen_locals
{'a': ' '}

The initial ref creation in https://github.com/python/cpython/blob/e924bb667a19ee1812d6c7592a37dd37346dda04/Objects/frameobject.c#L338 uses Py_NewRef, which seems fine.

But this ownership check in genobject.c looks highly suspicious now: https://github.com/python/cpython/blob/e924bb667a19ee1812d6c7592a37dd37346dda04/Objects/genobject.c#L64

I initially thought that when sys._getframe() is called inside the generator, the frame ownership would have been flipped over to FRAME_OWNED_BY_FRAME_OBJECT, but reading the code suggests that isn't the case (see edit below). Further experimentation also suggests this as a plausible source of the problem:

>>> gen = g(); gen_frame = next(gen); gen_locals = gen_frame.f_locals
>>> del gen
>>> gen_frame.f_locals
Segmentation fault (core dumped)

Assuming that analysis is correct, I think at the very least, both sys._getframe and framelocalsproxy_new need to ensure that the frame data is owned by either the C stack or the frame object, since it isn't safe for the generator or the thread object to clear the frame anymore. The take_ownership function can be used to transfer the data management to normal refcounting rules if necessary.

Leaving the data ownership alone for data owned by the C stack is based on take_ownership asserting that the frame isn't still owned by the C stack. However, I'm pretty sure that case isn't being handled correctly either, I just don't understand the data management intricacies well enough to make a concrete suggestion for improving matters.

Edit: I found the line I had missed that makes this work in the general case. Creating a Python frame object inherently grants ownership of the frame data to that linked frame object: https://github.com/python/cpython/blob/e924bb667a19ee1812d6c7592a37dd37346dda04/Objects/frameobject.c#L1863

So it looks like mere code introspection isn't going to be enough to track this one down, it's going to take experimentation on a local build.

The fact f_code behaves itself in the following example also makes it more likely this is a problem specific to the frame locals proxy rather than relating to a general frame data ownership regression:

>>> gen = g()
>>> gen_frame = next(gen)
>>> gen_frame is gen.gi_frame
True
>>> gen_frame.f_code
<code object g at 0x7f00713cf110, file "<python-input-28>", line 1>
>>> del gen
>>> gen_frame.f_code
<code object g at 0x7f00713cf110, file "<python-input-28>", line 1>
>>> gen_frame.f_locals
{'a': 'n'}
gaogaotiantian commented 1 week ago

@ncoghlan are you still investigating or you need help on this? I don't have any clue at this point so I'll need to dig into the code base as well. If this is too much for you I can take over. However, if you can figure it out and fix it, I'd appreciate it :)

Zheaoli commented 1 week ago

I can not reproduce it on main branch but I can reproduce it on 3.13 branch. I think I can find more clue about the root cause.

Zheaoli commented 1 week ago

Fun fact, after bisect, I found #119209 may fixed this issue in some circumstance. When I run the test code

import sys
def g():
    a = 1
    yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
    exec("snapshot, live_locals = next(g())", locals=ns)
    print(ns)

https://github.com/python/cpython/commit/e9875ecb5dd3a9c44a184c71cc562ce1fea6e03b Would be correct and crash in https://github.com/python/cpython/commit/73ab83b27f105a4509046ce26e35f20d66625195

BTW, I compile the code in ./configure --enable-optimizations

When I try to compile the CPython ./configure --with-pydebug --with-trace-refs --enable-profiling, the test code will crash in both main branch and 3.13 branch.

I think I may need some time to dive more deeper

ncoghlan commented 6 hours ago

@gaogaotiantian I'm not currently investigating this one (my initial investigation was mostly to work out how the test suite had missed it, which turned out to be the "frame proxy must outlive its creating evaluation loop" criterion for hitting the crash). It would have been nice if the problem had been obvious merely on reading the code, but alas, the cause (whatever it turns out to be) isn't that simple.

@Zheaoli That's definitely an interesting find. It suggests we might be inadvertently creating a reference cycle somewhere, and that commit happened to change exactly when the cycle gets collected. (The locals proxy isn't supposed to be getting involved in any cycles that aren't explicitly created in the Python code, but there's clearly something unexpected going on, so inadvertently created cycles can't be ruled out in advance)

ncoghlan commented 5 hours ago

Something that occurred to me is whether it might be something the specialising interpreter is doing with the refcount handling (or the generator frame state management), rather than being inherent in the C code itself.

It's the fact that this works:

    exec("gen = g()", locals=ns)
    exec("snapshot, live_locals = next(gen)", locals=ns)
    print(ns)

while this potentially crashes:

    exec("snapshot, live_locals = next(g())", locals=ns)
    print(ns)

that got me thinking in that direction, as the only substantive difference I can see between the two versions is that in the second form, g() never leaves the eval loop's stack (it is fed directly into the subsequent next() call), while in the first form it does (since it is bound to gen in the first eval loop).

If the eval loop ends up clearing a generator frame state it doesn't actually own when the eval loop terminates in the second scenario, that would explain that state being gone when the frame proxy tries to access it.