Open Cacsjep opened 2 years ago
Have you reported this to uvicorn/starlette bug tracker? Can you provide a reproducer with only asyncio and no third-party packages? Have you tested main branch?
I will try to create a reproducer with only asncio, I just testet latest released version and not main branch.
When a OS error happen on the finish_accept function then _BaseProactorEventLoop._start_serving.loop_ falls into that OSError except block and close the listener sock at line 864.
But its not the listening sock that produce this OS Error it is the connection socket, so an OS Error on connection sock leads into closing the listening sock.
Connection socket after this issue happen:
<socket.socket fd=848, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 8088), raddr=(0, b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')> (0, b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
While i am trying to create a reproducer with asyncio could you tell me, is that a expected behaviour?
Thank u
Tested with latest main branch and also can reproduce this error. Here is a reproducer with asyncio, hope that helps.
Server
import asyncio
class Serve(asyncio.Protocol):
def connection_made(self, transport):
self.transport = transport
def data_received(self, data):
self.transport.write(b"OK")
async def main():
loop = asyncio.get_running_loop()
server = await loop.create_server(lambda: Serve(), '127.0.0.1', 8088)
async with server:
await server.serve_forever()
asyncio.run(main(), debug=True)
Client
import threading
import requests
import sys
def create():
for i in range(100):
requests.post("http://127.0.0.1:8088/", data="acs")
if __name__ == "__main__":
for i in range(100):
thread = threading.Thread(target=create, args=(), daemon=True)
thread.start()
sys.exit()
Latest Findings
When AcceptEx encounter an ERROR_NETNAME_DELETED with WSAGetLastError() (overlapped.c), this leads into an OSError when ov.getresult() is called at windows_events.py:576, this OSError is then catched up by BaseProactorEventLoop proactor_events.py:856 and leads into closing the listening sock.
The "real" underlying error that on my case occurs is a "WSAECONNRESET" while accepting the connection socket.
How can we deal with WSAECONNRESET or ERROR_NETNAME_DELETED errors when accpeting a socket without closing the listening socket, a easy fix is just closing accept socket and raise a ConnectionResetError that is cached from BaseProactorEventLoop serving loop.
--- a/Lib/asyncio/windows_events.py
+++ b/Lib/asyncio/windows_events.py
@@ -577,7 +577,13 @@ def accept(self, listener):
ov.AcceptEx(listener.fileno(), conn.fileno())
def finish_accept(trans, key, ov):
- ov.getresult()
+ try:
+ ov.getresult()
+ except OSError as exc:
+ if exc.winerror in (_overlapped.ERROR_NETNAME_DELETED,
+ _overlapped.ERROR_OPERATION_ABORTED):
+ conn.close()
+ raise ConnectionResetError(*exc.args)
+ raise
# Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work.
buf = struct.pack('@P', listener.fileno())
conn.setsockopt(socket.SOL_SOCKET,
index ddb9daca02..4b3d24c4e8 100644
--- a/Lib/asyncio/proactor_events.py
+++ b/Lib/asyncio/proactor_events.py
@@ -854,6 +854,8 @@ def loop(f=None):
if self.is_closed():
return
f = self._proactor.accept(sock)
+ except ConnectionResetError:
+ self.call_soon(loop)
except OSError as exc:
if sock.fileno() != -1:
self.call_exception_handler({
Can you post a diff rather than copying functions as it will easier to apply your changes? Also seems like https://github.com/python/cpython/pull/18199 is similar to this issue.
Sure, I updated my latest comment. Yes looks similar.
The PR linked above (gh-18199) is more appropriate for the issue in recv()
and friends, gh-83037. The current issue is about accept()
. A separate fix needs to be developed.
Assuming the best reaction to ERROR_NETNAME_DELETED
is indeed retrying, there would be a number of options. We could special-case that error in proactor_events.py::BaseProactorEventLoop._start_serving.loop, or we could silently retry in windows_events.py::accept.finish_accept.
I have another niggling suspicion. There is a background Task
involved here, created by
tasks.ensure_future(coro, loop=self._loop)
in windows_events.py::accept. It is known that sometimes such tasks get garbage-collected before they've completed (e.g. https://github.com/python/cpython/pull/96323). But really, I don't think it is the cause of @iUnknwn's problem, based on the logs he sent me privately (those show that task as completed with exception=OSError(22, 'The specified network name is no longer available', None, 64, None)
).
Anyway, if anyone knows how to reproduce getting ERROR_NETNAME_DELETED from an accept() call, please let us know!
@kumaraditya303 Have you ever heard of ERROR_NETNAME_DELETED?
I remember this, it is a very annoying error ERROR_NETNAME_DELETED and hard to reproduce, I don't remember much now and haven't used asyncio
on Windows in a while.
Could it be caused by IIS in reverse proxy mode?
One thing I remember is that when I was getting this I looked at how other programming languages handle it and IIRC golang just retries in this case. This might have changed as it's been a while since then.
Could it be caused by IIS in reverse proxy mode?
I don't know. Maybe @eryksun can shed some light on this issue?
Winsock special cases several status codes returned by the AFD device that implements socket files. For example:
NTSTATUS | Winsock Error | Generic Windows Error |
---|---|---|
STATUS_LOCAL_DISCONNECT | WSAECONNABORTED | ERROR_NETNAME_DELETED |
STATUS_CONNECTION_RESET | WSAECONNRESET | ERROR_NETNAME_DELETED |
STATUS_REMOTE_DISCONNECT | WSAECONNRESET | ERROR_NETNAME_DELETED |
If you're using asynchronous AcceptEx()
and rely on the last error that's set by GetOverlappedResult()
, then you'll get the generic Windows error ERROR_NETNAME_DELETED
instead of the specialized Winsock error. The Overlapped
type could support getting the last NT status that's set in overlapped.Internal
. This could be used to disambiguate ConnectionAbortedError
and ConnectionResetError
in this case.
Either way it means the connection was reset or disconnected, it's the same to asyncio. So it seems accept()
should probably just retry without doing anything (there's no connection it can return). Perhaps in debug mode it could log something at the debug level. For recv()
and friends we should discuss this in gh-83037.
Looking for a volunteer to implement the proposed fix from my previous comment -- accept()
should just retry instead of closing the socket, when ERROR_NETNAME_DELETED
occurs. The code from gh-18199 can serve as an example.
Latest Findings
When AcceptEx encounter an ERROR_NETNAME_DELETED with WSAGetLastError() (overlapped.c), this leads into an OSError when ov.getresult() is called at windows_events.py:576, this OSError is then catched up by BaseProactorEventLoop proactor_events.py:856 and leads into closing the listening sock.
The "real" underlying error that on my case occurs is a "WSAECONNRESET" while accepting the connection socket.
How can we deal with WSAECONNRESET or ERROR_NETNAME_DELETED errors when accpeting a socket without closing the listening socket, a easy fix is just closing accept socket and raise a ConnectionResetError that is cached from BaseProactorEventLoop serving loop.
--- a/Lib/asyncio/windows_events.py +++ b/Lib/asyncio/windows_events.py @@ -577,7 +577,13 @@ def accept(self, listener): ov.AcceptEx(listener.fileno(), conn.fileno()) def finish_accept(trans, key, ov): - ov.getresult() + try: + ov.getresult() + except OSError as exc: + if exc.winerror in (_overlapped.ERROR_NETNAME_DELETED, + _overlapped.ERROR_OPERATION_ABORTED): + conn.close() + raise ConnectionResetError(*exc.args) + raise # Use SO_UPDATE_ACCEPT_CONTEXT so getsockname() etc work. buf = struct.pack('@P', listener.fileno()) conn.setsockopt(socket.SOL_SOCKET,
index ddb9daca02..4b3d24c4e8 100644 --- a/Lib/asyncio/proactor_events.py +++ b/Lib/asyncio/proactor_events.py @@ -854,6 +854,8 @@ def loop(f=None): if self.is_closed(): return f = self._proactor.accept(sock) + except ConnectionResetError: + self.call_soon(loop) except OSError as exc: if sock.fileno() != -1: self.call_exception_handler({
Thanks. This avoids my Uvicorn API (FastAPI) from crashing. Although, the following error continues to appear on the terminal:
OSError: [WinError 64] The specified network name is no longer available
ConnectionResetError: [WinError 64] The specified network name is no longer available
It happens to me when making multiple API requets in a short time to an endpoint which fetches data from another external API.
I stumbled upon this solution after discovering it myself for the same issue mentioned. For me, this bug manifest itself by causing my server to hang up, requiring a hard restart to restore it's ability to serve again.
There are other ov.getresult()
calls (one in accept, one in connect, and one in accept_pipe) - given we can get a client side closing the socket at any point I think it would make sense to catch the errors consistently for each state of the IocpProactor class.
+1 for solving the immediate issue. If we can get this in it would be a step in the right direction.
Hey Guys
Application Description
We discover on our uvicorn server sometimes that the listening socket is closing. The Uvicorn get image uploads from cameras and sometimes or often releated to mobile network conditions the connection is lost. And when a connection is lost on remote site, this error occurs sometimes.
I am not very deep into such low level python code so I tried it hard to find the root couse and this are my results. I hope this results show's that there is sometings and we can talk about that or you can give me hints or help do resolve this issue.
thank you in advance
Bug/Error that occurs
The Error occurs on ov.getresult() (asyncio.windows_events.py:560) OSError: [WinError 64] The specified network name is no longer available.
After that the server is unresponsive and the listening socket is closed. In that case we need to restart the service.
To me it looks like this happens when ov.getresult() is called and the remote host is disconnected. When I just wrap this function in a try\except block the listening socket is not closing and uvicorn detects a connection lost.
Please let me know why just wrapping this into a try block resolve this issue, because I don't know ;-)
Original
Mod (No socket closings)
Error without the try\catch block
Minimal Example to reproduce
Server (Minimal Reproducible Example)
Client (Minimal Reproducible Example)
Tested with: Uvicorn [x] Hypercorn [x]
Environment Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
Linked PRs