Open ncoghlan opened 2 weeks ago
@gaogaotiantian @markshannon Genuinely strange interaction here when frame proxies for generator objects outlive both their generator and their eval loop. Hard to hit in normal code, but pretty easy to hit at an interactive prompt (assuming you're messing about with locals proxies interactively for some reason).
I assume it relates to something going wrong with the handoff of frame state from the eval loop to the generator object (given the garbage data being emitted, maybe the frame proxy pointer is still referencing the no longer valid loop frame?) (Note: it's literally been years since I looked at that code, so I don't actually know how it currently works. This assumption is only based on a vague memory of how it used to work around 3.10 or so)
Definitely looks to be some kind of refcounting issue when the proxy is the only thing still referencing the underlying frame:
>>> import sys
... def g():
... a = 1
... yield sys._getframe()
...
>>> gen = g(); gen_frame = next(gen); gen_locals = gen_frame.f_locals
>>> gen_locals
{'a': 1}
>>> del gen, gen_frame; gen_locals
{}
>>> gen = g(); gen_frame = next(gen); gen_locals = gen_frame.f_locals
>>> gen_locals
{'a': 1}
>>> del gen, gen_frame; gen_locals
{'a': ' '}
The initial ref creation in https://github.com/python/cpython/blob/e924bb667a19ee1812d6c7592a37dd37346dda04/Objects/frameobject.c#L338 uses Py_NewRef
, which seems fine.
But this ownership check in genobject.c
looks highly suspicious now: https://github.com/python/cpython/blob/e924bb667a19ee1812d6c7592a37dd37346dda04/Objects/genobject.c#L64
I initially thought that when sys._getframe()
is called inside the generator, the frame ownership would have been flipped over to FRAME_OWNED_BY_FRAME_OBJECT
, but reading the code suggests that isn't the case (see edit below). Further experimentation also suggests this as a plausible source of the problem:
>>> gen = g(); gen_frame = next(gen); gen_locals = gen_frame.f_locals
>>> del gen
>>> gen_frame.f_locals
Segmentation fault (core dumped)
Assuming that analysis is correct, I think at the very least, both sys._getframe
and framelocalsproxy_new
need to ensure that the frame data is owned by either the C stack or the frame object, since it isn't safe for the generator or the thread object to clear the frame anymore. The take_ownership
function can be used to transfer the data management to normal refcounting rules if necessary.
Leaving the data ownership alone for data owned by the C stack is based on take_ownership
asserting that the frame isn't still owned by the C stack. However, I'm pretty sure that case isn't being handled correctly either, I just don't understand the data management intricacies well enough to make a concrete suggestion for improving matters.
Edit: I found the line I had missed that makes this work in the general case. Creating a Python frame object inherently grants ownership of the frame data to that linked frame object: https://github.com/python/cpython/blob/e924bb667a19ee1812d6c7592a37dd37346dda04/Objects/frameobject.c#L1863
So it looks like mere code introspection isn't going to be enough to track this one down, it's going to take experimentation on a local build.
The fact f_code
behaves itself in the following example also makes it more likely this is a problem specific to the frame locals proxy rather than relating to a general frame data ownership regression:
>>> gen = g()
>>> gen_frame = next(gen)
>>> gen_frame is gen.gi_frame
True
>>> gen_frame.f_code
<code object g at 0x7f00713cf110, file "<python-input-28>", line 1>
>>> del gen
>>> gen_frame.f_code
<code object g at 0x7f00713cf110, file "<python-input-28>", line 1>
>>> gen_frame.f_locals
{'a': 'n'}
@ncoghlan are you still investigating or you need help on this? I don't have any clue at this point so I'll need to dig into the code base as well. If this is too much for you I can take over. However, if you can figure it out and fix it, I'd appreciate it :)
I can not reproduce it on main branch but I can reproduce it on 3.13 branch. I think I can find more clue about the root cause.
Fun fact, after bisect, I found #119209 may fixed this issue in some circumstance. When I run the test code
import sys
def g():
a = 1
yield locals(), sys._getframe().f_locals
ns = {}
for i in range(10):
exec("snapshot, live_locals = next(g())", locals=ns)
print(ns)
https://github.com/python/cpython/commit/e9875ecb5dd3a9c44a184c71cc562ce1fea6e03b Would be correct and crash in https://github.com/python/cpython/commit/73ab83b27f105a4509046ce26e35f20d66625195
BTW, I compile the code in ./configure --enable-optimizations
When I try to compile the CPython ./configure --with-pydebug --with-trace-refs --enable-profiling
, the test code will crash in both main branch and 3.13 branch.
I think I may need some time to dive more deeper
@gaogaotiantian I'm not currently investigating this one (my initial investigation was mostly to work out how the test suite had missed it, which turned out to be the "frame proxy must outlive its creating evaluation loop" criterion for hitting the crash). It would have been nice if the problem had been obvious merely on reading the code, but alas, the cause (whatever it turns out to be) isn't that simple.
@Zheaoli That's definitely an interesting find. It suggests we might be inadvertently creating a reference cycle somewhere, and that commit happened to change exactly when the cycle gets collected. (The locals proxy isn't supposed to be getting involved in any cycles that aren't explicitly created in the Python code, but there's clearly something unexpected going on, so inadvertently created cycles can't be ruled out in advance)
Something that occurred to me is whether it might be something the specialising interpreter is doing with the refcount handling (or the generator frame state management), rather than being inherent in the C code itself.
It's the fact that this works:
exec("gen = g()", locals=ns)
exec("snapshot, live_locals = next(gen)", locals=ns)
print(ns)
while this potentially crashes:
exec("snapshot, live_locals = next(g())", locals=ns)
print(ns)
that got me thinking in that direction, as the only substantive difference I can see between the two versions is that in the second form, g()
never leaves the eval loop's stack (it is fed directly into the subsequent next()
call), while in the first form it does (since it is bound to gen
in the first eval loop).
If the eval loop ends up clearing a generator frame state it doesn't actually own when the eval loop terminates in the second scenario, that would explain that state being gone when the frame proxy tries to access it.
Crash report
What happened?
Checking some frame proxy behaviour at the interactive prompt, I encountered the following crash:
(Crash was initially seen in the new REPL, the above reproduction in the basic REPL showed it wasn't specific to the new REPL).
Subsequent investigation suggests that the problem relates to the frame proxy outliving the eval loop that created it, as the following script was able to reliably reproduce the crash (as shown below):
Changing the code to explicitly keep the generator object alive:
is sufficient to eliminate the crash:
The crash is still eliminated, even when
gen
is created viaexec
rather than creating it in the main eval loop:CPython versions tested on:
3.13, CPython main branch
Operating systems tested on:
Linux
Output from running 'python -VV' on the command line:
No response