Open colesbury opened 2 weeks ago
I'm able to reproduce it locally ~3-8% of the time with a command like on Linux (x86-64) and macOS (arm64):
./python -m test test_threading -m test_interrupt_main_subthread -R 3:3
The bug is triggered because the KeyboardInterrupt
is raised inside the finally
handler before it's able to actually restore the state:
I'm able to reproduce the issue reliably, including in the default build, with the following patch: https://gist.github.com/colesbury/74e2523c3622f479f7e39fc62c5dc948
@mpage I'd appreciate your thoughts on this since you've recently looked at related thread-safety issues. I'm not sure if there's a reliable fix without moving a lot of the Condition
code into C.
I'm not sure what the right thing to do here is. I think the fundamental challenge is that we need to make wait()
(and probably other parts of Condition
) reliable in the face of asynchronous exceptions, which is really hard (impossible?) to do in managed Python code. Another alternative to moving the Condition
code into C might be to add functionality to block/unblock delivery of asynchronous exceptions and use it at the appropriate points in wait()
. Something like the following:
def wait(self, timeout=None):
if not self._is_owned():
raise RuntimeError("cannot wait on un-acquired lock")
waiter = _allocate_lock()
waiter.acquire()
self._waiters.append(waiter)
# Queue delivery of asynchronous exceptions until the body of the context
# manager completes, or they are unblocked explicitly
with _thread.block_async_exceptions():
saved_state = self._release_save()
gotit = False
try: # restore state no matter what (e.g., KeyboardInterrupt)
if timeout is None:
# Allow asynchronous exceptions, queued or otherwise, to be
# raised in the body of the context manager
with _thread.allow_async_exceptions():
waiter.acquire()
gotit = True
else:
if timeout > 0:
with _thread.allow_async_exceptions():
gotit = waiter.acquire(True, timeout)
else:
with _thread.allow_async_exceptions():
gotit = waiter.acquire(False)
return gotit
finally:
self._acquire_restore(saved_state)
if not gotit:
try:
self._waiters.remove(waiter)
except ValueError:
pass
Reminds me of trio's KeyboardInterrupt protection, with an 2017 blog post. Marking a frame to block the delivery seems a bit neater than two context managers?
@encukou - thanks for the link, that sounds like the same issue. Marking the frame is tricky because the self._acquire_restore()
(in the finally
block) call is a call to a Python function and the KeyboardInterrupt
happens on entry. So marking the Condition.wait()
frame doesn't seem like it'd be sufficient.
So I think the options we've come up with are:
1) Move things to C. For making the test pass reliably, it's enough if the Thread._started
event is implemented in C. Here's an example https://github.com/colesbury/cpython/commit/d0118fae59577646590662dba2ba596e86ccf7df I tested yesterday. There may also be a way to do this as part of the ThreadHandle
that @mpage wrote.
2) Suppress asynchronous exceptions in Condition
(and maybe other synchronization primitives written in Python)
Either approach seems reasonable to me. What do both of you think?
EDIT: trio changes the signal handler, but that seems less appropriate in CPython than option 2.
Recording the gist of my conversation with @colesbury.
If we can make it work, I think option (2) is preferable, for a few reasons:
Condition
likely applies to the other synchronization primitives in the threading
module, so we'd need to move those to C as well. Condition
can accept an arbitrary lock object, which could execute arbitrary Python code, leaving us open to the original problem.I don't think I'll have time to have a proper look at this before beta, so a few quick comments:
self._acquire_restore()
(in the finally block) call is a call to a Python function
I guess, that would need to be inlined (that is, always set in __init__
).
Or also guarded.
trio changes the signal handler
AFAIK, this just trio doing whatever it can from Python code -- if it could do something like disable some of the eval breaker, it'd do that instead.
Other Python code outside of CPython may need to deal with the same issue
Yup, that's why I'd look at what works for trio :)
@Zac-HD, you might be interested in this issue
I'm vaguely familiar with Trio's attempts to solve this, but if you've already read the blog post above I don't think I'd have much more to add. "Arbitrary Python code' in a Condition
sounds like trouble, though.
Bug report
Seen on the ARM64 macOS nogil refleaks buildbot:
https://buildbot.python.org/all/#/builders/1368/builds/865/steps/5/logs/stdio
Reported by encoku
Linked PRs