python / cpython

The Python programming language
https://www.python.org/
Other
60.06k stars 29.08k forks source link

`test_threading.test_interrupt_main_subthread`: release unlocked lock #118433

Open colesbury opened 2 weeks ago

colesbury commented 2 weeks ago

Bug report

======================================================================
ERROR: test_interrupt_main_subthread (test.test_threading.InterruptMainTests.test_interrupt_main_subthread)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 659, in wait
    signaled = self._cond.wait(timeout)
               ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 368, in wait
    self._acquire_restore(saved_state)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 315, in _acquire_restore
    def _acquire_restore(self, x):

KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/test/test_threading.py", line 2035, in test_interrupt_main_subthread
    t.start()
    ~~~~~~~^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 977, in start
    self._started.wait()  # Will set ident and native_id
    ~~~~~~~~~~~~~~~~~~^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 656, in wait
    with self._cond:
    ...<3 lines>...
        return signaled
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 307, in __exit__
    return self._lock.__exit__(*args)
           ~~~~~~~~~~~~~~~~~~~^^^^^^^
RuntimeError: release unlocked lock
----------------------------------------------------------------------

Seen on the ARM64 macOS nogil refleaks buildbot:

https://buildbot.python.org/all/#/builders/1368/builds/865/steps/5/logs/stdio

Reported by encoku

Linked PRs

colesbury commented 2 weeks ago

I'm able to reproduce it locally ~3-8% of the time with a command like on Linux (x86-64) and macOS (arm64):

./python -m test test_threading -m test_interrupt_main_subthread -R 3:3
colesbury commented 2 weeks ago

The bug is triggered because the KeyboardInterrupt is raised inside the finally handler before it's able to actually restore the state:

https://github.com/python/cpython/blob/72dae53e09a5344bf4922d934a34a2fa48a11c86/Lib/threading.py#L368

I'm able to reproduce the issue reliably, including in the default build, with the following patch: https://gist.github.com/colesbury/74e2523c3622f479f7e39fc62c5dc948

colesbury commented 2 weeks ago

@mpage I'd appreciate your thoughts on this since you've recently looked at related thread-safety issues. I'm not sure if there's a reliable fix without moving a lot of the Condition code into C.

mpage commented 2 weeks ago

I'm not sure what the right thing to do here is. I think the fundamental challenge is that we need to make wait() (and probably other parts of Condition) reliable in the face of asynchronous exceptions, which is really hard (impossible?) to do in managed Python code. Another alternative to moving the Condition code into C might be to add functionality to block/unblock delivery of asynchronous exceptions and use it at the appropriate points in wait(). Something like the following:

def wait(self, timeout=None):
    if not self._is_owned():
        raise RuntimeError("cannot wait on un-acquired lock")
    waiter = _allocate_lock()
    waiter.acquire()
    self._waiters.append(waiter)
    # Queue delivery of asynchronous exceptions until the body of the context
    # manager completes, or they are unblocked explicitly
    with _thread.block_async_exceptions():
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
                # Allow asynchronous exceptions, queued or otherwise, to be
                # raised in the body of the context manager
                with _thread.allow_async_exceptions():
                    waiter.acquire()
                gotit = True
            else:
                if timeout > 0:
                    with _thread.allow_async_exceptions():
                        gotit = waiter.acquire(True, timeout)
                else:
                    with _thread.allow_async_exceptions():
                        gotit = waiter.acquire(False)
            return gotit
        finally:
            self._acquire_restore(saved_state)
            if not gotit:
                try:
                    self._waiters.remove(waiter)
                except ValueError:
                    pass
encukou commented 2 weeks ago

Reminds me of trio's KeyboardInterrupt protection, with an 2017 blog post. Marking a frame to block the delivery seems a bit neater than two context managers?

colesbury commented 2 weeks ago

@encukou - thanks for the link, that sounds like the same issue. Marking the frame is tricky because the self._acquire_restore() (in the finally block) call is a call to a Python function and the KeyboardInterrupt happens on entry. So marking the Condition.wait() frame doesn't seem like it'd be sufficient.

colesbury commented 2 weeks ago

So I think the options we've come up with are:

1) Move things to C. For making the test pass reliably, it's enough if the Thread._started event is implemented in C. Here's an example https://github.com/colesbury/cpython/commit/d0118fae59577646590662dba2ba596e86ccf7df I tested yesterday. There may also be a way to do this as part of the ThreadHandle that @mpage wrote.

2) Suppress asynchronous exceptions in Condition (and maybe other synchronization primitives written in Python)

Either approach seems reasonable to me. What do both of you think?

EDIT: trio changes the signal handler, but that seems less appropriate in CPython than option 2.

mpage commented 2 weeks ago

Recording the gist of my conversation with @colesbury.

If we can make it work, I think option (2) is preferable, for a few reasons:

  1. The issue that's affecting Condition likely applies to the other synchronization primitives in the threading module, so we'd need to move those to C as well.
  2. As @colesbury pointed out, Condition can accept an arbitrary lock object, which could execute arbitrary Python code, leaving us open to the original problem.
  3. Other Python code outside of CPython may need to deal with the same issue and moving to C might not be a viable solution in those cases. Providing a solution in Python would be valuable for those use cases.
encukou commented 2 weeks ago

I don't think I'll have time to have a proper look at this before beta, so a few quick comments:

self._acquire_restore() (in the finally block) call is a call to a Python function

I guess, that would need to be inlined (that is, always set in __init__). Or also guarded.

trio changes the signal handler

AFAIK, this just trio doing whatever it can from Python code -- if it could do something like disable some of the eval breaker, it'd do that instead.

Other Python code outside of CPython may need to deal with the same issue

Yup, that's why I'd look at what works for trio :)

encukou commented 3 days ago

@Zac-HD, you might be interested in this issue

Zac-HD commented 3 days ago

I'm vaguely familiar with Trio's attempts to solve this, but if you've already read the blog post above I don't think I'd have much more to add. "Arbitrary Python code' in a Condition sounds like trouble, though.