Closed nabheet closed 5 months ago
and if I start hypercorn like this:
$ pipenv run hypercorn main:app -b localhost:9002 --access-log "-"
And I run curl like this:
$ while [ 0 -eq 0 ]; do curl --ipv4 -sv http://localhost:9002/; if [ $? -ne 0 ]; then echo HELLO; break; fi; done
After a few seconds, I get this error:
[2024-06-09 18:40:33 -0600] [40021] [INFO] 127.0.0.1:52769 - - [09/Jun/2024:18:40:33 -0600] "GET / 1.1" 200 25 "-" "curl/8.6.0"
socket.accept() out of system resource
socket: <asyncio.TransportSocket fd=3, family=2, type=1, proto=0, laddr=('127.0.0.1', 9002)>
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/selector_events.py", line 178, in _accept_connection
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/socket.py", line 295, in accept
OSError: [Errno 24] Too many open files
EDIT: I was able to prevent this error by:
ulimit -n 4000
seems like for my case the lsof -p $PID | wc -l
is hovering around 1000-ish.
but if I run curl like this:
$ while [ 0 -eq 0 ]; do curl --ipv4 -sv -H "Connection: close" http://localhost:9002/; if [ $? -ne 0 ]; then echo HELLO; break; fi; done
it runs for quite a while.
Looks like this problem was introduced in v0.17.0
.
But we are running v0.16.0
where I was seeing the issues. so I will keep debugging with v0.16.0.
so now I am testing v0.16.0
, I started hypercorn:
$ pipenv run hypercorn main:app -b localhost:9002 --access-log "-"
I run apache bench like this:
$ ab -n 10000000 -c 2 -t300 http://localhost:9002/
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 5000 requests
Completed 10000 requests
Completed 15000 requests
... ->>> GETS STUCK HERE
also, it seems like setting the keep-alive-timeout
to 0 avoids the too many files open error
.
but this error doesn't happen if I slow down the curl loop:
count=0;while [ 0 -eq 0 ]; do curl --ipv4 -sv http://localhost:9002/;if [ $? -ne 0 ]; then echo HELLO; break; fi;count=$((count+1)); echo $count;sleep 0.01; done
I think I might be getting distracted with something that might not be related to the original issue.
And when I run h2load
like this:
$ h2load -n 20000 -c 4 --rps 5 http://localhost:9002/ --verbose
It always says 4000 succeeded.
...
finished in 205.01s, 19.51 req/s, 1.18KB/s
requests: 20000 total, 4004 started, 4000 done, 4000 succeeded, 16000 failed, 16000 errored, 0 timeout
status codes: 4002 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 241.43KB (247228) total, 37.99KB (38902) headers (space savings 90.74%), 97.66KB (100000) data
min max mean sd +/- sd
time for request: 1.61ms 93.27ms 5.30ms 12.82ms 93.78%
time for connect: 243us 330us 290us 37us 50.00%
time to 1st byte: 5.68ms 5.76ms 5.72ms 41us 100.00%
req/s : 4.88 4.88 4.88 0.00 75.00%
when I run curl like this:
$ count=0;while [ 0 -eq 0 ]; do curl --ipv4 -sv -H "Connection: close" http://localhost:9002/ --http2 --http2-prior-knowledge;if [ $? -ne 0 ]; then echo HELLO; break; fi;count=$((count+1)); echo $count; done
Curl reports an error after 600 or so requests:
* Host localhost:9002 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
* Trying 127.0.0.1:9002...
* Connected to localhost (127.0.0.1) port 9002
* [HTTP/2] [1] OPENED stream for http://localhost:9002/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: http]
* [HTTP/2] [1] [:authority: localhost:9002]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [user-agent: curl/8.6.0]
* [HTTP/2] [1] [accept: */*]
> GET / HTTP/2
> Host: localhost:9002
> User-Agent: curl/8.6.0
> Accept: */*
> Connection: close
>
* Recv failure: Connection reset by peer
* Failed receiving HTTP2 data: 56(Failure when receiving data from the peer)
* Connection #0 to host localhost left intact
HELLO
I am not really sure my test cases are valid but I would think these curl, ab and h2load executions should go on forever or at least finish the total number of requests.
Aite - it seems h2load
-n
is related to keep_alive_max_requests
. Not sure why but h2load
completes successfully if these two numbers match.
After a lot of googling, I am reasonably confident that this approach causes the event loop to exit cleanly and allow all connection finish before terminating. But I am not sure how to incorporate it in the hypercorn code.
The meat of the functionality is in the _signal_handler
coroutine(?):
import signal
from typing import Any
from fastapi import FastAPI
from hypercorn.config import Config
import asyncio
from hypercorn.asyncio import serve
import logging
app = FastAPI()
logger = logging.getLogger(__name__)
@app.get("/")
def root():
return {"message": "Hello World"}
def main():
config = Config()
config.application_path = "main:app"
config.bind = ["localhost:9002"]
config.loglevel = "INFO"
config.accesslog = "-"
config.errorlog = "-"
shutdown_event = asyncio.Event()
async def _signal_handler(*_: Any) -> None:
print("HERE!!!!!!!!")
shutdown_event.set()
tasks = [t for t in asyncio.all_tasks() if t is not
asyncio.current_task()]
logger.info(f"Cancelling {len(tasks)} outstanding tasks")
await asyncio.gather(*tasks, return_exceptions=True)
logging.info(f"Flushing metrics")
loop = asyncio.get_event_loop()
for s in (signal.SIGHUP, signal.SIGTERM, signal.SIGINT):
print(f"Handling {s} ...")
loop.add_signal_handler(s, lambda s=s: asyncio.create_task(_signal_handler(s, loop)))
loop.run_until_complete(
serve(app, config, shutdown_trigger=shutdown_event.wait)
)
if __name__ == "__main__":
main()
When I start this server like:
$ pipenv run python main.py
And target it with this ab
test:
$ ab -n 100000 -c 4 http://localhost:9002/
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 10000 requests
apr_socket_recv: Connection refused (61)
Total of 18376 requests completed
You can see that when I hit CTRL+C
in the pipenv terminal, the ab
execution correctly received a connection refused instead of connection reset by peer
. I am not sure how to add this functionality to hypercorn
so I can test with multiple processes/workers.
LOL! and right after I posted the potential solution above, I am consistently getting connection reset by peer
error. 🤣
Benchmarking localhost (be patient)
Completed 10000 requests
apr_socket_recv: Connection reset by peer (54)
Total of 14011 requests completed
I give up! please help!!!!
here is what I "think" might be causing the issue. Looks like something in Python land is sending the reset on shutdown after the client has already initiated a connection and the server has acknowledged it. I have no idea how to fix it.
Firstly, I am sorry for going down this incorrect path regarding this problem. It turns out the root cause was our Python process was dying because of a segmentation fault.
Secondly, I found the issue by enabling faulthandler. This logged the exception stacktrace in the Cloud Run logs along with the HTTP503 error we have been investigating. So, if your python process is mysteriously disappearing, turn on the faulthandler
and you will get all the information you need.
Now to the hard part of fixing the problem. 😄
We have been getting HTTP 503 errors in our API hosted on GCP Cloud Run. I have been trying to chase this problem down for a long time but I haven't made any progress. I will share my suspicions here and hopefully we can resolve this issue. I used the following sample code and have been testing with
apache-bench
andcurl
. And this is currently limited to HTTP1.1. After starting hypercorn with fast api, I run apache-bench against it. Then I hit CTRL+C in the terminal where hypercorn is running. I am expecting hypercorn to successfully complete all pending connections, return their response, close them, then exit and refuse any new connections.Sample code:
Hypercorn server start with
-w 0
:When I run curl like this:
I correctly get the following:
Whereas if I start it like this:
I get the following output:
I have a suspicion that the
Connection reset by peer
error is causing problems for us in GCP Cloud Run which ends up returning a HTTP 503 error to our customers.I am hoping that it should not return 503. I think it should successfully complete all existing connections and close them, and not accept any new connections.
Please let me know what you think.