python / cpython

The Python programming language
https://www.python.org
Other
63.49k stars 30.41k forks source link

`BaseSubprocessTransport.__del__` fails if the event loop is already closed, which can leak an orphan process #114177

Open gschaffner opened 10 months ago

gschaffner commented 10 months ago

Bug report

Bug description:

there is a race where it's possible for BaseSubprocessTransport.__del__ to try to close the transport after the event loop has been closed. this results in an unraisable exception in __del__, and it can also result in an orphan process being leaked.

the following is a reproducer that triggers the race between run() exiting and [the process dying and the event loop learning about the process's death]. on my machine, with this reproducer the bug occurs (due to run() winning the race) maybe 90% of the time:

from __future__ import annotations

import asyncio
from subprocess import PIPE

async def main() -> None:
    try:
        async with asyncio.timeout(1):
            process = await asyncio.create_subprocess_exec(
                "/usr/bin/env",
                "sh",
                "-c",
                "while true; do sleep 1; done",
                stdin=PIPE,
                stdout=PIPE,
                stderr=PIPE,
            )
            try:
                await process.wait()
            except BaseException:
                process.kill()
                # N.B.: even though they send it SIGKILL, the user is (very briefly)
                # leaking an orphan process to asyncio, because they are not waiting for
                # the event loop to learn that the process died. if we added
                # while True:
                #     try:
                #         await process.wait()
                #     except CancelledError:
                #         pass
                #     else:
                #         break
                # (i.e. if we used structured concurrency) then the race would not
                # occur.
                raise
    except TimeoutError:
        pass

if __name__ == "__main__":
    asyncio.run(main())

most of the time, running this emits

Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7fd25db428e0>
Traceback (most recent call last):
    File "/usr/lib/python3.11/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
    File "/usr/lib/python3.11/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
    File "/usr/lib/python3.11/asyncio/unix_events.py", line 566, in close
    self._close(None)
    File "/usr/lib/python3.11/asyncio/unix_events.py", line 590, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
    File "/usr/lib/python3.11/asyncio/base_events.py", line 761, in call_soon
    self._check_closed()
    File "/usr/lib/python3.11/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

this case looks similar to GH-109538. i think the following patch (analogous to GH-111983) fixes it:

diff --git a/Lib/asyncio/base_subprocess.py b/Lib/asyncio/base_subprocess.py
index 6dbde2b696..ba219ba39d 100644
--- a/Lib/asyncio/base_subprocess.py
+++ b/Lib/asyncio/base_subprocess.py
@@ -123,8 +123,11 @@ def close(self):

     def __del__(self, _warn=warnings.warn):
         if not self._closed:
-            _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
-            self.close()
+            if self._loop.is_closed():
+                _warn("loop is closed", ResourceWarning, source=self)
+            else:
+                _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
+                self.close()

     def get_pid(self):
         return self._pid

however, there is another case for which the above patch is not sufficient. in the above example the user orphaned the process after sending SIGKILL/TerminateProcess (which is not immediate, but only schedules the kill), but what if they fully orphan it?

from __future__ import annotations

import asyncio
from subprocess import PIPE

async def main_leak_subprocess() -> None:
    await asyncio.create_subprocess_exec(
        "/usr/bin/env",
        "sh",
        "-c",
        "while true; do sleep 1; done",
        stdin=PIPE,
        stdout=PIPE,
        stderr=PIPE,
    )

if __name__ == "__main__":
    asyncio.run(main_leak_subprocess())

currently (on main), when the race condition occurs (for this example the condition is run() winning the race against BaseSubprocessTransport GC) then asyncio emits a loud complaint Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f5b3b291e40> and leaks the orphan process (check htop after the interpreter exits!). asyncio probably also leaks the pipes.

but with the patch above, asyncio will quietly leak the orphan process (and probably pipes), but it will not yell about the leak unless the user enables ResourceWarnings. which is not good.

so a more correct patch (fixes both cases) may be something along the lines of

diff --git a/Lib/asyncio/base_subprocess.py b/Lib/asyncio/base_subprocess.py
index 6dbde2b696..9c86c8444c 100644
--- a/Lib/asyncio/base_subprocess.py
+++ b/Lib/asyncio/base_subprocess.py
@@ -123,8 +123,31 @@ def close(self):

     def __del__(self, _warn=warnings.warn):
         if not self._closed:
-            _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
-            self.close()
+            if not self._loop.is_closed():
+                _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
+                self.close()
+            else:
+                _warn("loop is closed", ResourceWarning, source=self)
+
+                # self.close() requires the event loop to be open, so we need to reach
+                # into close() and its dependencies and manually do the bare-minimum
+                # cleanup that they'd do if the loop was open. I.e. we do the syscalls
+                # only; we can't interact with an event loop.
+
+                # TODO: Probably need some more stuff here too so that we don't leak fd's...
+
+                if (self._proc is not None and
+                        # has the child process finished?
+                        self._returncode is None and
+                        # the child process has finished, but the
+                        # transport hasn't been notified yet?
+                        self._proc.poll() is None):
+
+                    try:
+                        self._proc.kill()
+                    except (ProcessLookupError, PermissionError):
+                        # the process may have already exited or may be running setuid
+                        pass

     def get_pid(self):
         return self._pid

with this patch applied, neither example leaks an orphan process out of run(), and both examples emit ResourceWarning. however this patch is rather messy. it is also perhaps still leaking pipe fd's out of run(). (the fd's probably get closed by the OS when the interpreter shuts down, but i suspect one end of each pipe will be an orphan from the time when run() exits to the time when the interpreter shuts down, which can be arbitrarily long).

CPython versions tested on:

3.11, CPython main branch

Operating systems tested on:

Linux

agronholm commented 10 months ago

Attempting any sort of event loop operation in a destructor is hazardous because you can't be sure which thread the destructor runs on. On CPython you can usually ensure that it happens in the event loop's thread, but if a different GC implementation is used (PyPy et al), this could happen in any thread.

defkev commented 6 months ago

I am now seeing this as well (not always tho) after upgrading to 3.12.3 (from 3.11.?) on Arch when KeyboardInterrupting asyncio.run():

Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7dc21d030d60>
Traceback (most recent call last):
  File "/usr/lib/python3.12/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/usr/lib/python3.12/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/usr/lib/python3.12/asyncio/unix_events.py", line 767, in close
    self.write_eof()
  File "/usr/lib/python3.12/asyncio/unix_events.py", line 753, in write_eof
    self._loop.call_soon(self._call_connection_lost, None)
  File "/usr/lib/python3.12/asyncio/base_events.py", line 795, in call_soon
    self._check_closed()
  File "/usr/lib/python3.12/asyncio/base_events.py", line 541, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed