pgjones / hypercorn

Hypercorn is an ASGI and WSGI Server based on Hyper libraries and inspired by Gunicorn.
MIT License
1.05k stars 94 forks source link

TCP server keep-alive times out even though data is being received. #235

Open dbrnz opened 1 month ago

dbrnz commented 1 month ago

We have a quart based application server running under hypercorn that, upon receipt of a POST, runs a time-consuming process (using multiprocessing in a separate thread). The initiating client then polls every second using GET to obtain information about its long running process. All well and good, and everything works as expected in a local network environment.

Things though start breaking when the server is deployed behind an Apache proxy, with the client occasionally getting 502 proxy errors. After debugging and tracing things, and comparing with the working local client, it appears that:

  1. In the localhost case, a new TCPServer is created for each GET request. This server enters idle wait after reading the request and responding -- the next read finds EOF (because the requests based client has closed the connection after it received a response), that results in the task group and the server's writer stream closing, cancelling the idle wait, and terminating the server instance.
  2. The proxy case differs in that only two TCPServer instances are created, as Apache keeps socket connections open and uses them for subsequent requests. Depending on how requests are shared between the servers, one can timeout on keep-alive. For example, with a three second timeout and one second polling, if server B handles three concurrent GET requests after server A goes into idle wait, then server A will timeout. This timeout results in the underlying socket connection being closed and a subsequent 502 from Apache.

@pgjones does this reasoning make sense? It certainly is consistent with my observations and traces. The workaround for me is to increase the keep-alive timeout to something like 10 minutes.

pgjones commented 1 month ago

Could the issue be explained here? If so I think the default keep alive timeout for apache and Hypercorn is 5 seconds so I'd try with Hypercorn at 6 and see if the issue is solved?

dbrnz commented 1 month ago

Thanks for that link -- it certainly explains things! I now retry proxy failures (502, 503 and 504 result codes) and all seems to be well with the default 5 second timeouts, although I will try your suggestion of a six second Hypercorn timeout to see if that works instead.

BTW, is there a way to have Hypercorn log when it closed a connection because of a timeout?