It seems hypercorn is not closing connections properly

nabheet commented 5 months ago

We have been getting HTTP 503 errors in our API hosted on GCP Cloud Run. I have been trying to chase this problem down for a long time but I haven't made any progress. I will share my suspicions here and hopefully we can resolve this issue. I used the following sample code and have been testing with apache-bench and curl. And this is currently limited to HTTP1.1. After starting hypercorn with fast api, I run apache-bench against it. Then I hit CTRL+C in the terminal where hypercorn is running. I am expecting hypercorn to successfully complete all pending connections, return their response, close them, then exit and refuse any new connections.

Sample code:

from fastapi import FastAPI
from hypercorn.config import Config
from hypercorn.run import run

app = FastAPI()

@app.get("/")
def root():
    return {"message": "Hello World"}

Hypercorn server start with -w 0:

$ pipenv run hypercorn main:app -b localhost:9002 -w 0
[2024-06-09 16:59:20 -0600] [21585] [INFO] Running on http://127.0.0.1:9002 (CTRL + C to quit)
^C

When I run curl like this:

$ while [ 0 -eq 0 ]; do curl --ipv4 -sv -H "Connection: close" http://localhost:9002/; if [ $? -ne 0 ]; then echo HELLO; break; fi; done

I correctly get the following:

... # WHOLE BUNCH OF OUTPUT UNTIL I HIT CTRL+C ON TERMINAL WITH HYPERCORN
< HTTP/1.1 200
< content-length: 25
< content-type: application/json
< date: Sun, 09 Jun 2024 23:40:17 GMT
< server: hypercorn-h11
< Connection: close
<
* Closing connection
{"message":"Hello World"}* Host localhost:9002 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying 127.0.0.1:9002...
* connect to 127.0.0.1 port 9002 from 127.0.0.1 port 50990 failed: Connection refused
* Failed to connect to localhost port 9002 after 0 ms: Couldn't connect to server
* Closing connection
HELLO

Whereas if I start it like this:

$ pipenv run hypercorn main:app -b localhost:9002
[2024-06-09 17:42:10 -0600] [10806] [INFO] Running on http://127.0.0.1:9002 (CTRL + C to quit)
^C

I get the following output:

... # WHOLE BUNCH OF OUTPUT UNTIL I HIT CTRL+C ON TERMINAL WITH HYPERCORN
< HTTP/1.1 200
< content-length: 25
< content-type: application/json
< date: Sun, 09 Jun 2024 23:42:21 GMT
< server: hypercorn-h11
< Connection: close
<
* Closing connection
{"message":"Hello World"}* Host localhost:9002 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying 127.0.0.1:9002...
* Connected to localhost (127.0.0.1) port 9002
> GET / HTTP/1.1
> Host: localhost:9002
> User-Agent: curl/8.6.0
> Accept: */*
> Connection: close
>
* Recv failure: Connection reset by peer
* Closing connection
HELLO

I have a suspicion that the Connection reset by peer error is causing problems for us in GCP Cloud Run which ends up returning a HTTP 503 error to our customers.

I am hoping that it should not return 503. I think it should successfully complete all existing connections and close them, and not accept any new connections.

Please let me know what you think.

nabheet commented 5 months ago

and if I start hypercorn like this:

$ pipenv run hypercorn main:app -b localhost:9002 --access-log "-"

And I run curl like this:

$ while [ 0 -eq 0 ]; do curl --ipv4 -sv http://localhost:9002/; if [ $? -ne 0 ]; then echo HELLO; break; fi; done

After a few seconds, I get this error:

[2024-06-09 18:40:33 -0600] [40021] [INFO] 127.0.0.1:52769 - - [09/Jun/2024:18:40:33 -0600] "GET / 1.1" 200 25 "-" "curl/8.6.0"
socket.accept() out of system resource
socket: <asyncio.TransportSocket fd=3, family=2, type=1, proto=0, laddr=('127.0.0.1', 9002)>
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/selector_events.py", line 178, in _accept_connection
  File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socket.py", line 295, in accept
OSError: [Errno 24] Too many open files

EDIT: I was able to prevent this error by:

ulimit -n 4000

seems like for my case the lsof -p $PID | wc -l is hovering around 1000-ish.

nabheet commented 5 months ago

but if I run curl like this:

$ while [ 0 -eq 0 ]; do curl --ipv4 -sv -H "Connection: close" http://localhost:9002/; if [ $? -ne 0 ]; then echo HELLO; break; fi; done

it runs for quite a while.

nabheet commented 5 months ago

Looks like this problem was introduced in v0.17.0.

But we are running v0.16.0 where I was seeing the issues. so I will keep debugging with v0.16.0.

nabheet commented 5 months ago

so now I am testing v0.16.0, I started hypercorn:

$ pipenv run hypercorn main:app -b localhost:9002 --access-log "-"

I run apache bench like this:

$ ab -n 10000000 -c 2 -t300 http://localhost:9002/
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 5000 requests
Completed 10000 requests
Completed 15000 requests
... ->>> GETS STUCK HERE

nabheet commented 5 months ago

also, it seems like setting the keep-alive-timeout to 0 avoids the too many files open error.

but this error doesn't happen if I slow down the curl loop:

count=0;while [ 0 -eq 0 ]; do curl --ipv4 -sv http://localhost:9002/;if [ $? -ne 0 ]; then echo HELLO; break; fi;count=$((count+1)); echo $count;sleep 0.01; done

I think I might be getting distracted with something that might not be related to the original issue.

nabheet commented 5 months ago

And when I run h2load like this:

$ h2load -n 20000 -c 4 --rps 5 http://localhost:9002/ --verbose

It always says 4000 succeeded.

...
finished in 205.01s, 19.51 req/s, 1.18KB/s
requests: 20000 total, 4004 started, 4000 done, 4000 succeeded, 16000 failed, 16000 errored, 0 timeout
status codes: 4002 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 241.43KB (247228) total, 37.99KB (38902) headers (space savings 90.74%), 97.66KB (100000) data
                     min         max         mean         sd        +/- sd
time for request:     1.61ms     93.27ms      5.30ms     12.82ms    93.78%
time for connect:      243us       330us       290us        37us    50.00%
time to 1st byte:     5.68ms      5.76ms      5.72ms        41us   100.00%
req/s           :       4.88        4.88        4.88        0.00    75.00%

nabheet commented 5 months ago

when I run curl like this:

$ count=0;while [ 0 -eq 0 ]; do curl --ipv4 -sv -H "Connection: close"  http://localhost:9002/ --http2 --http2-prior-knowledge;if [ $? -ne 0 ]; then echo HELLO; break; fi;count=$((count+1)); echo $count; done

Curl reports an error after 600 or so requests:

* Host localhost:9002 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying 127.0.0.1:9002...
* Connected to localhost (127.0.0.1) port 9002
* [HTTP/2] [1] OPENED stream for http://localhost:9002/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: http]
* [HTTP/2] [1] [:authority: localhost:9002]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [user-agent: curl/8.6.0]
* [HTTP/2] [1] [accept: */*]
> GET / HTTP/2
> Host: localhost:9002
> User-Agent: curl/8.6.0
> Accept: */*
> Connection: close
>
* Recv failure: Connection reset by peer
* Failed receiving HTTP2 data: 56(Failure when receiving data from the peer)
* Connection #0 to host localhost left intact
HELLO

I am not really sure my test cases are valid but I would think these curl, ab and h2load executions should go on forever or at least finish the total number of requests.

nabheet commented 5 months ago

Aite - it seems h2load -n is related to keep_alive_max_requests. Not sure why but h2load completes successfully if these two numbers match.

nabheet commented 5 months ago

After a lot of googling, I am reasonably confident that this approach causes the event loop to exit cleanly and allow all connection finish before terminating. But I am not sure how to incorporate it in the hypercorn code.

The meat of the functionality is in the _signal_handler coroutine(?):

import signal
from typing import Any
from fastapi import FastAPI
from hypercorn.config import Config
import asyncio
from hypercorn.asyncio import serve
import logging

app = FastAPI()

logger = logging.getLogger(__name__)

@app.get("/")
def root():
    return {"message": "Hello World"}

def main():
    config = Config()
    config.application_path = "main:app"
    config.bind = ["localhost:9002"]
    config.loglevel = "INFO"
    config.accesslog = "-"
    config.errorlog = "-"

    shutdown_event = asyncio.Event()

    async def _signal_handler(*_: Any) -> None:
        print("HERE!!!!!!!!")
        shutdown_event.set()
        tasks = [t for t in asyncio.all_tasks() if t is not
             asyncio.current_task()]
        logger.info(f"Cancelling {len(tasks)} outstanding tasks")
        await asyncio.gather(*tasks, return_exceptions=True)
        logging.info(f"Flushing metrics")

    loop = asyncio.get_event_loop()
    for s in (signal.SIGHUP, signal.SIGTERM, signal.SIGINT):
        print(f"Handling {s} ...")
        loop.add_signal_handler(s, lambda s=s: asyncio.create_task(_signal_handler(s, loop)))

    loop.run_until_complete(
        serve(app, config, shutdown_trigger=shutdown_event.wait)
    )

if __name__ == "__main__":
    main()

When I start this server like:

$ pipenv run python main.py

And target it with this ab test:

$ ab -n 100000 -c 4 http://localhost:9002/
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 10000 requests
apr_socket_recv: Connection refused (61)
Total of 18376 requests completed

You can see that when I hit CTRL+C in the pipenv terminal, the ab execution correctly received a connection refused instead of connection reset by peer. I am not sure how to add this functionality to hypercorn so I can test with multiple processes/workers.

nabheet commented 5 months ago

LOL! and right after I posted the potential solution above, I am consistently getting connection reset by peer error. 🤣

Benchmarking localhost (be patient)
Completed 10000 requests
apr_socket_recv: Connection reset by peer (54)
Total of 14011 requests completed

I give up! please help!!!!

nabheet commented 5 months ago

here is what I "think" might be causing the issue. Looks like something in Python land is sending the reset on shutdown after the client has already initiated a connection and the server has acknowledged it. I have no idea how to fix it.

nabheet commented 5 months ago

Firstly, I am sorry for going down this incorrect path regarding this problem. It turns out the root cause was our Python process was dying because of a segmentation fault.

Secondly, I found the issue by enabling faulthandler. This logged the exception stacktrace in the Cloud Run logs along with the HTTP503 error we have been investigating. So, if your python process is mysteriously disappearing, turn on the faulthandler and you will get all the information you need.

Now to the hard part of fixing the problem. 😄

pgjones / hypercorn

It seems hypercorn is not closing connections properly #241