python / cpython

The Python programming language
https://www.python.org
Other
62.9k stars 30.13k forks source link

Unable to cancel Server.serve_forever() if a reader is blocked while reading (3.12 regression) #123720

Open paravoid opened 1 month ago

paravoid commented 1 month ago

Bug report

Bug description:

Consider this code, slightly simplifying the documentation's TCPServer example code:

import asyncio

async def handle_echo(reader, writer):
    print("Reading")
    data = await reader.read(100)
    message = data.decode()
    print(f"Received '{message!r}'")

    print("Closing the connection")
    writer.close()
    await writer.wait_closed()

async def main():
    server = await asyncio.start_server(handle_echo, "127.0.0.1", 8888)
    print("Serving forever...")
    async with server:
        try:
            await server.serve_forever()  
        except asyncio.CancelledError:
            print("Cancelled by Ctrl+C")
            server.close()

asyncio.run(main())

(My code is closer to a while True: await reader.readline(); ..., but the above probably suffices as a demonstration)

Running this results in a Serving forever, and hitting Ctrl+C, results in a Cancelled by Ctrl+C, and a normal exit.

However, if in another window we nc 127.0.0.1 8888, and leave the connection open, Ctrl+C (SIGINT) does nothing, and a second Ctrl+C is required to terminate. (This however breaks out of the asyncio loop by raising KeyboardInterrupt() , as documented).

So basically clients can prevent the server from cleanly exiting by just keeping their connection open.

This is a regression: this fails with 3.12.5 and 3.13.0-rc1 but works with 3.11.9.

This is because (TTBOMK) of this code in base_events.py:

        try:
            await self._serving_forever_fut
        except exceptions.CancelledError:
            try:
                self.close()
                await self.wait_closed()
            finally:
                raise
        finally:
            self._serving_forever_fut = None

I believe this to be related to the wait_closed() changes, 5d09d11aa0b89aeba187f4f520728ccaf4fc5ac1, 26553695592ad399f735d4dbaf32fd871d0bb1e1 etc. (Cc @gvanrossum). Related issues #104344 and #113538.

Per @gvanrossum in https://github.com/python/cpython/issues/113538#issuecomment-1874724821: "_In 3.10 and before, server.wait_closed() was a no-op, unless you called it before server.close(), in a task (asyncio.create_task(server.waitclosed())). The unclosed connection was just getting abandoned."

CancelledError() is caught here, which spawns wait_closed() before re-raising the exception. In 3.12+, wait_closed()... actually waits for the connection to close, as intended. However, while this prevents the reader task from being abandoned, it does not allow neither the callers of reader.read() or serve_forever() to catch CancelledError() and clean up (such as actually closing the connection, potentially after e.g. a signaling to the client a server close through whatever protocol is implemented here).

Basically no user code is executed until the client across the network drops the connection.

As far as I know, it's currently impossible to handle SIGINTs cleanly with clients blocked in a read() without messing with deep asyncio/selector internals, which seems like a pretty serious limitation? Have I missed something?

CPython versions tested on:

3.11, 3.12, 3.13

Operating systems tested on:

Linux

Linked PRs

gvanrossum commented 1 month ago

I will try to look at this when I'm at the core sprint, Sept 23-27.

gvanrossum commented 2 weeks ago

@paravoid The above code doesn't work; it crashes on line 6. I guess you didn't actually run it before filing the bug? Also I don't know what nc does with these parameters, could you please add a client that triggers the issue so I can understand, or at least explain what this nc invocation does?

Finally, your code differs (apart from the crash) from the original code in several ways:

Which of these is significant?

paravoid commented 2 weeks ago

@gvanrossum Thanks for taking a look! I did run the code but tried simplifying it after the fact (it's all in the GitHub history). Fixed now, it was a s/line/data/. Sorry about that!

nc just connects and waits for input. There is no need to do anything on the socket. Instead of nc you can also do: python3 -c "import socket, time; s = socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.connect(('localhost', 8888)); time.sleep(100)" in another window.

Finally, your code differs (apart from the crash) from the original code in several ways:

  • The handler never writes back to the writer
  • It has a try/except around serve_forever()

Which of these is significant?

None of the two are significant. The write has been removed for simplicity (it doesn't matter). The try/except is there to make it easier to see that error handling does work when a client is not connected.

gvanrossum commented 2 weeks ago

My first reaction, after putting a bunch of print calls in the code you quoted in Server.close() in base_events.py, is that this was done intentionally (but without thinking through all the consequences).

The cure would then to be to put a timeout in each handler, e.g. async with asyncio.timeout(5) around the await reader.read(100). Then the behavior will be:

You could also catch the TimeoutError and continue to close the connection (so you don't get a traceback logged). However it looks that in this case you will hang in the handler's await writer.wait_closed() until the client cooperates (which might be never) so you still have the same problem. It seems that a secondary bug is that that await doesn't honor timeouts -- this is something we might be able to fix separately.

In any case, this gives a workaround for 3.12 and 3.13 (the latter isn't out yet but has been in feature freeze for months) but it's far from perfect. I'm not sure what to do to make the experience better without adding timeouts to every handler. Maybe the solution is just to send CancelledError into every handler that's still active when the server is closed? That's close to what I assumed was happening before 3.12 -- the server process responds immediately (?) to ^C, so the handlers are effectively abandoned.

What do you think, @paravoid ?

gvanrossum commented 2 weeks ago

Here's a diff that appears to fix your issue and break no tests (though it clearly needs a new test that will demonstrate it changed something).

index ffcc0174e1e..a42c4ec0037 100644
--- a/Lib/asyncio/base_events.py
+++ b/Lib/asyncio/base_events.py
@@ -381,6 +381,7 @@ async def serve_forever(self):
         except exceptions.CancelledError:
             try:
                 self.close()
+                self.close_clients()
                 await self.wait_closed()
             finally:
                 raise
paravoid commented 2 weeks ago

Here's a diff that appears to fix your issue and break no tests (though it clearly needs a new test that will demonstrate it changed something).

Ah! That's neat. I originally discovered this in 3.12 and only later reproduced it in 3.13, so I hadn't noticed the addition of close_clients(). This fix seems so simple in retrospect! I tried it out and it seems to work as intended with my limited testing.

In light of this fix, it may be entirely moot now, but with regards to your previous message: adding arbitrarily-valued timeouts doesn't strike me as a great solution. The value may be too short for legitimate use cases, forcing users to have to reimplement retries in a loop to read() again etc. It'll be even more of a nightmare with readline(). That's going to be a lot of tricky boilerplate code, to address pitfalls similar to the ones that PEP 475 (Retry system calls failing with EINTR) tried to address IMHO. At the same time, the timeout value may also be too long, and a noticeable lag for shutdowns.

gvanrossum commented 2 weeks ago

I will have to ponder whether it's okay to backport this to 3.12 and 3.13. I can't decide whether it's a new feature that might break code that worked before in 3.12 or 3.13, or just a bug fix (which would qualify for backporting). To deal with the cancellation, handlers that want to send a final goodbye to their clients will now have to catch CancelledError. Then again in 3.12/3.13 they currently have no way to tell that the server is being closed, so, yeah, I'm thinking of it more and more as a bugfix that more or less restores the old (3.11 and earlier) behavior.

I have to come up with a test case too. :-(

paravoid commented 3 days ago

Thanks so much for pushing a fix! My (non-strong) opinion is that this deserves to be fixed in a minor release, both because it's a regression compared to 3.11, and because it's hard to workaround in downstream code, at least without reimplementing some of stdlib's internals. (For 3.12 it's even worse, because close_clients() doesn't even exist there.)

gvanrossum commented 3 days ago

I will try to find time to complete the PR. I will have to ask the 3.12 and 3.13 release managers for their opinion on the backports.