lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud β˜οΈπŸš€
http://lithops.cloud
Apache License 2.0
315 stars 103 forks source link

Mismatch between `gunicorn --timeout` and GCP Cloud Run `runtime_timeout` #1382

Closed cisaacstern closed 1 month ago

cisaacstern commented 2 months ago

First of all, just wanted to say how much I appreciate this project! It is truly incredible and a joy to use. πŸ‘ πŸ‘

Lately I've been experimenting with the GCP Cloud Run backend and encountered a situation where, despite using the default GCP Cloud Run runtime_timeout of 300s, I was seeing function calls being killed by gunicorn at the 30 second mark. From the invoker/client standpoint, this manifests as HTTP 500 Internal Server Error, and on Cloud Run logs, it looks like:

...
File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
    sys.exit(1)
SystemExit: 1
Full traceback ``` Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 135, in handle self.handle_request(listener, req, client, addr) File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 178, in handle_request respiter = self.wsgi(environ, resp.start_response) File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1498, in __call__ return self.wsgi_app(environ, start_response) File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1473, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] File "/lithops/lithopsproxy.py", line 58, in run function_handler(message) File "/lithops/lithops/worker/handler.py", line 83, in function_handler python_queue_consumer(0, work_queue, ) File "/lithops/lithops/worker/handler.py", line 135, in python_queue_consumer prepare_and_run_task(task) File "/lithops/lithops/worker/handler.py", line 163, in prepare_and_run_task run_task(task) File "/lithops/lithops/worker/handler.py", line 214, in run_task jrp.join(task.execution_timeout) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 40, in wait if not wait([self.sentinel], timeout): File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/usr/local/lib/python3.10/selectors.py", line 416, in select fd_event_list = self._selector.poll(timeout) File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 203, in handle_abort sys.exit(1) SystemExit: 1 ```

After some head-scratching, I eventually realized that this was because the gunicorn was using its default --timeout 30 and therefore killing workers after 30 seconds. In the custom container I am using, setting --timeout 300 resolved this issue for me.

In terms of a possible solution, I did notice that in the knative backend default image, --timeout $TIMEOUT appears to be propagated through to gunicorn, but for GCP Cloud Run, while that variable appears to be set, it is not passed through to gunicorn --timeout: https://github.com/lithops-cloud/lithops/blob/41f24cfed6beb996547f1b1546913e7e6116dcde/runtime/gcp_cloudrun/Dockerfile#L50

Would it be correct to guess that passing --timeout $TIMEOUT here would resolve this issue for the default GCP Cloud Run container (on which my custom container is based)?

If so, or if another solution is preferable, I am happy to contribute a PR. Thanks again for all of your work on this! Hopefully I can show my appreciation by making some useful contributions.

xref https://github.com/lithops-cloud/lithops/issues/1362#issuecomment-2137112180 as (thematically, if not directly) related

JosepSampe commented 2 months ago

Hi @cisaacstern, good catch! Yes, as you stated, the dockerfile template is missing the --timeout $TIMEOUT in the gunicorn command.

In the Lithops default template it is included, so feel free to open a PR and include it in the runtimes/ template.