python / cpython

The Python programming language
https://www.python.org
Other
62.59k stars 30.03k forks source link

When accepting a socket connection and ERROR_NETNAME_DELETED occurs, leads this into a closing of the serving socket (BaseProactorEventLoop) #93821

Open Cacsjep opened 2 years ago

Cacsjep commented 2 years ago

Hey Guys

Application Description

We discover on our uvicorn server sometimes that the listening socket is closing. The Uvicorn get image uploads from cameras and sometimes or often releated to mobile network conditions the connection is lost. And when a connection is lost on remote site, this error occurs sometimes.

I am not very deep into such low level python code so I tried it hard to find the root couse and this are my results. I hope this results show's that there is sometings and we can talk about that or you can give me hints or help do resolve this issue.

thank you in advance

Bug/Error that occurs

The Error occurs on ov.getresult() (asyncio.windows_events.py:560) OSError: [WinError 64] The specified network name is no longer available.

After that the server is unresponsive and the listening socket is closed. In that case we need to restart the service.

To me it looks like this happens when ov.getresult() is called and the remote host is disconnected. When I just wrap this function in a try\except block the listening socket is not closing and uvicorn detects a connection lost.

Please let me know why just wrapping this into a try block resolve this issue, because I don't know ;-)

Original

def finish_accept(trans, key, ov):
     ov.getresult() # Here happen the OSError
     # Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work.
     buf = struct.pack('@P', listener.fileno())
     conn.setsockopt(socket.SOL_SOCKET,
         _overlapped.SO_UPDATE_ACCEPT_CONTEXT, buf)
     conn.settimeout(listener.gettimeout())
     return conn, conn.getpeername()

Mod (No socket closings)

def finish_accept(trans, key, ov):
     try:
         ov.getresult()
     except OSError as Error:
         print("This helps", Error)
     # Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work.
     buf = struct.pack('@P', listener.fileno())
     conn.setsockopt(socket.SOL_SOCKET,
         _overlapped.SO_UPDATE_ACCEPT_CONTEXT, buf)
     conn.settimeout(listener.gettimeout())
     return conn, conn.getpeername()

Error without the try\catch block

Task exception was never retrieved
future: <Task finished name='Task-248' coro=<IocpProactor.accept.<locals>.accept_coro() done, defined at C:\Program Files\Python310\lib\asyncio\windows_events.py:570> exception=OSError(22, 'The specified network name is no longer available', None, 64, None) created at C:\Program Files\Python310\lib\asyncio\tasks.py:636>
source_traceback: Object created at (most recent call last):
  File "Y:\code_projects\pol\pol\server\test\brotle.py", line 31, in <module>
    asyncio.run(serve(Starlette(debug=True, routes=routes), config), debug=True)
  File "C:\Program Files\Python310\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Program Files\Python310\lib\asyncio\base_events.py", line 633, in run_until_complete
    self.run_forever()
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 321, in run_forever
    super().run_forever()
  File "C:\Program Files\Python310\lib\asyncio\base_events.py", line 600, in run_forever
    self._run_once()
  File "C:\Program Files\Python310\lib\asyncio\base_events.py", line 1888, in _run_once
    handle._run()
  File "C:\Program Files\Python310\lib\asyncio\events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "C:\Program Files\Python310\lib\asyncio\proactor_events.py", line 838, in loop
    f = self._proactor.accept(sock)
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 580, in accept
    tasks.ensure_future(coro, loop=self._loop)
  File "C:\Program Files\Python310\lib\asyncio\tasks.py", line 615, in ensure_future
    return _ensure_future(coro_or_future, loop=loop)
  File "C:\Program Files\Python310\lib\asyncio\tasks.py", line 636, in _ensure_future
    return loop.create_task(coro_or_future)
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 573, in accept_coro
    await future
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 819, in _poll
    value = callback(transferred, key, ov)
  File "C:\Program Files\Python310\lib\asyncio\windows_events.py", line 561, in finish_accept
    ov.getresult()
OSError: [WinError 64] The specified network name is no longer available

Minimal Example to reproduce

  1. Start Server
  2. Start Client

Server (Minimal Reproducible Example)

from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route

async def test_case(request):
    return JSONResponse({'hello': 'world'})

routes = [
    Route("/", endpoint=test_case, methods=['POST'])
]

##### Uvicorn Test
import uvicorn
uvicorn.run(
    Starlette(debug=True, routes=routes),
    reload=False,
    log_level="trace",
    host="0.0.0.0",
    port=8088,
)

##### Hypercorn Test
#import asyncio
#from hypercorn.config import Config
#from hypercorn.asyncio import serve

#config = Config()
#config.bind = ["localhost:8088"]
#asyncio.run(serve(Starlette(debug=True, routes=routes), config), debug=True)

Client (Minimal Reproducible Example)

import threading
import requests
import time
import sys

def test():
    for i in range(100):
        requests.post("http://127.0.0.1:8088/")

if __name__ == "__main__":
    for i in range(100):
        thread = threading.Thread(target=test, args=(), daemon=True)
        thread.start()
    time.sleep(1)
    sys.exit()

Tested with: Uvicorn [x] Hypercorn [x]

Environment Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32

Linked PRs

kumaraditya303 commented 2 years ago

Have you reported this to uvicorn/starlette bug tracker? Can you provide a reproducer with only asyncio and no third-party packages? Have you tested main branch?

Cacsjep commented 2 years ago

I will try to create a reproducer with only asncio, I just testet latest released version and not main branch.

When a OS error happen on the finish_accept function then _BaseProactorEventLoop._start_serving.loop_ falls into that OSError except block and close the listener sock at line 864.

But its not the listening sock that produce this OS Error it is the connection socket, so an OS Error on connection sock leads into closing the listening sock.

Connection socket after this issue happen: <socket.socket fd=848, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 8088), raddr=(0, b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')> (0, b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

While i am trying to create a reproducer with asyncio could you tell me, is that a expected behaviour?

Thank u

Cacsjep commented 2 years ago

Tested with latest main branch and also can reproduce this error. Here is a reproducer with asyncio, hope that helps.

Server

import asyncio

class Serve(asyncio.Protocol):
    def connection_made(self, transport):
        self.transport = transport

    def data_received(self, data):
        self.transport.write(b"OK")

async def main():
    loop = asyncio.get_running_loop()
    server = await loop.create_server(lambda: Serve(), '127.0.0.1', 8088)

    async with server:
        await server.serve_forever()

asyncio.run(main(), debug=True)

Client

import threading
import requests
import sys

def create():
    for i in range(100):
        requests.post("http://127.0.0.1:8088/", data="acs")

if __name__ == "__main__":
    for i in range(100):
        thread = threading.Thread(target=create, args=(), daemon=True)
        thread.start()
    sys.exit()
Cacsjep commented 2 years ago

Latest Findings

When AcceptEx encounter an ERROR_NETNAME_DELETED with WSAGetLastError() (overlapped.c), this leads into an OSError when ov.getresult() is called at windows_events.py:576, this OSError is then catched up by BaseProactorEventLoop proactor_events.py:856 and leads into closing the listening sock.

The "real" underlying error that on my case occurs is a "WSAECONNRESET" while accepting the connection socket.

How can we deal with WSAECONNRESET or ERROR_NETNAME_DELETED errors when accpeting a socket without closing the listening socket, a easy fix is just closing accept socket and raise a ConnectionResetError that is cached from BaseProactorEventLoop serving loop.

--- a/Lib/asyncio/windows_events.py
+++ b/Lib/asyncio/windows_events.py
@@ -577,7 +577,13 @@ def accept(self, listener):
         ov.AcceptEx(listener.fileno(), conn.fileno())

         def finish_accept(trans, key, ov):
-            ov.getresult()
+            try:
+                ov.getresult()
+            except OSError as exc:
+                if exc.winerror in (_overlapped.ERROR_NETNAME_DELETED,
+                                    _overlapped.ERROR_OPERATION_ABORTED):
+                    conn.close()
+                    raise ConnectionResetError(*exc.args)
+                raise
             # Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work.
             buf = struct.pack('@P', listener.fileno())
             conn.setsockopt(socket.SOL_SOCKET,
index ddb9daca02..4b3d24c4e8 100644
--- a/Lib/asyncio/proactor_events.py
+++ b/Lib/asyncio/proactor_events.py
@@ -854,6 +854,8 @@ def loop(f=None):
                 if self.is_closed():
                     return
                 f = self._proactor.accept(sock)
+            except ConnectionResetError:
+                self.call_soon(loop)
             except OSError as exc:
                 if sock.fileno() != -1:
                     self.call_exception_handler({
kumaraditya303 commented 2 years ago

Can you post a diff rather than copying functions as it will easier to apply your changes? Also seems like https://github.com/python/cpython/pull/18199 is similar to this issue.

Cacsjep commented 2 years ago

Sure, I updated my latest comment. Yes looks similar.

gvanrossum commented 1 year ago

The PR linked above (gh-18199) is more appropriate for the issue in recv() and friends, gh-83037. The current issue is about accept(). A separate fix needs to be developed.

Assuming the best reaction to ERROR_NETNAME_DELETED is indeed retrying, there would be a number of options. We could special-case that error in proactor_events.py::BaseProactorEventLoop._start_serving.loop, or we could silently retry in windows_events.py::accept.finish_accept.

I have another niggling suspicion. There is a background Task involved here, created by

        tasks.ensure_future(coro, loop=self._loop)

in windows_events.py::accept. It is known that sometimes such tasks get garbage-collected before they've completed (e.g. https://github.com/python/cpython/pull/96323). But really, I don't think it is the cause of @iUnknwn's problem, based on the logs he sent me privately (those show that task as completed with exception=OSError(22, 'The specified network name is no longer available', None, 64, None)).

Anyway, if anyone knows how to reproduce getting ERROR_NETNAME_DELETED from an accept() call, please let us know!

gvanrossum commented 1 year ago

@kumaraditya303 Have you ever heard of ERROR_NETNAME_DELETED?

kumaraditya303 commented 1 year ago

I remember this, it is a very annoying error ERROR_NETNAME_DELETED and hard to reproduce, I don't remember much now and haven't used asyncio on Windows in a while.

gvanrossum commented 1 year ago

Could it be caused by IIS in reverse proxy mode?

kumaraditya303 commented 1 year ago

One thing I remember is that when I was getting this I looked at how other programming languages handle it and IIRC golang just retries in this case. This might have changed as it's been a while since then.

kumaraditya303 commented 1 year ago

Could it be caused by IIS in reverse proxy mode?

I don't know. Maybe @eryksun can shed some light on this issue?

eryksun commented 1 year ago

Winsock special cases several status codes returned by the AFD device that implements socket files. For example:

NTSTATUS Winsock Error Generic Windows Error
STATUS_LOCAL_DISCONNECT WSAECONNABORTED ERROR_NETNAME_DELETED
STATUS_CONNECTION_RESET WSAECONNRESET ERROR_NETNAME_DELETED
STATUS_REMOTE_DISCONNECT WSAECONNRESET ERROR_NETNAME_DELETED

If you're using asynchronous AcceptEx() and rely on the last error that's set by GetOverlappedResult(), then you'll get the generic Windows error ERROR_NETNAME_DELETED instead of the specialized Winsock error. The Overlapped type could support getting the last NT status that's set in overlapped.Internal. This could be used to disambiguate ConnectionAbortedError and ConnectionResetError in this case.

gvanrossum commented 1 year ago

Either way it means the connection was reset or disconnected, it's the same to asyncio. So it seems accept() should probably just retry without doing anything (there's no connection it can return). Perhaps in debug mode it could log something at the debug level. For recv() and friends we should discuss this in gh-83037.

gvanrossum commented 1 year ago

Looking for a volunteer to implement the proposed fix from my previous comment -- accept() should just retry instead of closing the socket, when ERROR_NETNAME_DELETED occurs. The code from gh-18199 can serve as an example.

fercod commented 1 year ago

Latest Findings

When AcceptEx encounter an ERROR_NETNAME_DELETED with WSAGetLastError() (overlapped.c), this leads into an OSError when ov.getresult() is called at windows_events.py:576, this OSError is then catched up by BaseProactorEventLoop proactor_events.py:856 and leads into closing the listening sock.

The "real" underlying error that on my case occurs is a "WSAECONNRESET" while accepting the connection socket.

How can we deal with WSAECONNRESET or ERROR_NETNAME_DELETED errors when accpeting a socket without closing the listening socket, a easy fix is just closing accept socket and raise a ConnectionResetError that is cached from BaseProactorEventLoop serving loop.

--- a/Lib/asyncio/windows_events.py
+++ b/Lib/asyncio/windows_events.py
@@ -577,7 +577,13 @@ def accept(self, listener):
         ov.AcceptEx(listener.fileno(), conn.fileno())

         def finish_accept(trans, key, ov):
-            ov.getresult()
+            try:
+                ov.getresult()
+            except OSError as exc:
+                if exc.winerror in (_overlapped.ERROR_NETNAME_DELETED,
+                                    _overlapped.ERROR_OPERATION_ABORTED):
+                    conn.close()
+                    raise ConnectionResetError(*exc.args)
+                raise
             # Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work.
             buf = struct.pack('@P', listener.fileno())
             conn.setsockopt(socket.SOL_SOCKET,
index ddb9daca02..4b3d24c4e8 100644
--- a/Lib/asyncio/proactor_events.py
+++ b/Lib/asyncio/proactor_events.py
@@ -854,6 +854,8 @@ def loop(f=None):
                 if self.is_closed():
                     return
                 f = self._proactor.accept(sock)
+            except ConnectionResetError:
+                self.call_soon(loop)
             except OSError as exc:
                 if sock.fileno() != -1:
                     self.call_exception_handler({

Thanks. This avoids my Uvicorn API (FastAPI) from crashing. Although, the following error continues to appear on the terminal: OSError: [WinError 64] The specified network name is no longer available ConnectionResetError: [WinError 64] The specified network name is no longer available

It happens to me when making multiple API requets in a short time to an endpoint which fetches data from another external API.

CoreyBird commented 2 days ago

I stumbled upon this solution after discovering it myself for the same issue mentioned. For me, this bug manifest itself by causing my server to hang up, requiring a hard restart to restore it's ability to serve again.

There are other ov.getresult() calls (one in accept, one in connect, and one in accept_pipe) - given we can get a client side closing the socket at any point I think it would make sense to catch the errors consistently for each state of the IocpProactor class.

+1 for solving the immediate issue. If we can get this in it would be a step in the right direction.