Open conan-o-chang opened 4 years ago
What was your write_timeout value? Pods do not die instantly but are configured to stay for a grace period.
I'd like you to show how to reproduce this on Kubernetes, not with an arbitrary docker kill command as that's not necessarily the correct workflow and behaviour.
/set title: Support issue with Kubernetes
@alexellis Thank you so much for your comment. The reason is that I set write_timeout of the function too high. After setting write_timeout right, those pod will close correctly. Sorry for misunderstanding the meaning of write_timeout.
There is my testing code: https://github.com/conan-o-chang/openfaas-long-running-task
Hi @alexellis , sorry for bothering you again.
I find that If I set a high write_timeout of my function, the issue sill happening. fwatchdog keep running after fprocess was finished, and it make Kubernetes's pods keep Terminating
faas template store pull python3-flask
faas up
the function: faas up --skip-push -f python3-flask-long-running.yml
kubectl -n openfaas-fn patch deploy python3-flask-long-running -p "{\"spec\":{\"template\":{\"spec\":{\"terminationGracePeriodSeconds\": 600}}}}"
# We get 5 pods of the function.
Every 2.0s: kubectl get po -n openfaas-fn Mon Jul 6 02:05:20 2020
NAME READY STATUS RESTARTS AGE
python3-flask-long-running-5b6684b647-dv92b 1/1 Running 0 16s
python3-flask-long-running-5b6684b647-f445m 1/1 Running 0 11s
python3-flask-long-running-5b6684b647-jtvtl 1/1 Running 0 11s
python3-flask-long-running-5b6684b647-nprh5 1/1 Running 0 11s
python3-flask-long-running-5b6684b647-tk2mt 1/1 Running 0 11s
Scale down the function: kubectl scale deploy/python3-flask-long-running --replicas=1 -n openfaas-fn
# 4 pods are terminating
Every 2.0s: kubectl get po -n openfaas-fn Mon Jul 6 02:05:35 2020
NAME READY STATUS RESTARTS AGE
python3-flask-long-running-5b6684b647-dv92b 1/1 Running 0 31s
python3-flask-long-running-5b6684b647-f445m 0/1 Terminating 0 26s
python3-flask-long-running-5b6684b647-jtvtl 0/1 Terminating 0 26s
python3-flask-long-running-5b6684b647-nprh5 1/1 Terminating 0 26s
python3-flask-long-running-5b6684b647-tk2mt 1/1 Terminating 0 26s
Check the running pod: kubectl -n openfaas-fn exec -ti pods/python3-flask-long-running-5b6684b647-dv92b -- sh
# both fwatchdog and gunicorn(fprocess) is funning
~ $ ps
PID USER TIME COMMAND
1 app 0:00 fwatchdog
11 app 0:00 {gunicorn} /usr/local/bin/python /home/app/.local/bin/gunicorn index
15 app 0:00 {gunicorn} /usr/local/bin/python /home/app/.local/bin/gunicorn index
20 app 0:00 sh
26 app 0:00 ps
Check the terminating pod: kubectl -n openfaas-fn exec -ti pods/python3-flask-long-running-5b6684b647-f445m -- sh
# fwatchdog is keep running even gunicorn was finished.
~ $ ps
PID USER TIME COMMAND
1 app 0:00 fwatchdog
20 app 0:00 sh
25 app 0:00 ps
Finally, the pods is gone after 10 mins(equal to write_timeout of the function).
Every 2.0s: kubectl get po -n openfaas-fn Mon Jul 6 02:15:26 2020
NAME READY STATUS RESTARTS AGE
python3-flask-long-running-5b6684b647-dv92b 1/1 Running 0 10m
python3-flask-long-running-5b6684b647-f445m 0/1 Terminating 0 10m
python3-flask-long-running-5b6684b647-jtvtl 0/1 Terminating 0 10m
python3-flask-long-running-5b6684b647-nprh5 0/1 Terminating 0 10m
python3-flask-long-running-5b6684b647-tk2mt 0/1 Terminating 0 10m
Every 2.0s: kubectl get po -n openfaas-fn Mon Jul 6 02:15:50 2020
NAME READY STATUS RESTARTS AGE
python3-flask-long-running-5b6684b647-dv92b 1/1 Running 0 10m
To find out what happen to the pod, I try to delete the running pod for getting the log. kubectl logs -n openfaas-fn deploy/python3-flask-long-running -f
Forking - gunicorn [index:app -b 127.0.0.1:5000 --log-level debug --timeout=600 --graceful-timeout=600]
2020/07/05 18:05:05 Started logging stderr from function.
2020/07/05 18:05:05 Started logging stdout from function.
2020/07/05 18:05:05 OperationalMode: http
2020/07/05 18:05:05 Timeouts: read: 10m0s, write: 10m0s hard: 10m0s.
2020/07/05 18:05:05 Listening on port: 8080
2020/07/05 18:05:05 Writing lock-file to: /tmp/.lock
2020/07/05 18:05:05 Metrics listening on port: 8081
...
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [INFO] Starting gunicorn 20.0.4
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [DEBUG] Arbiter booted
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [INFO] Listening at: http://127.0.0.1:5000 (11)
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [INFO] Using worker: sync
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [15] [INFO] Booting worker with pid: 15
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [DEBUG] 1 workers
2020/07/05 18:35:12 SIGTERM received.. shutting down server in 10m0s
2020/07/05 18:35:12 Removing lock-file : /tmp/.lock
2020/07/05 18:35:12 stderr: [2020-07-05 18:35:12 +0000] [11] [INFO] Handling signal: term
2020/07/05 18:35:12 stderr: [2020-07-05 18:35:12 +0000] [15] [INFO] Worker exiting (pid: 15)
2020/07/05 18:35:12 stderr: [2020-07-05 18:35:12 +0000] [11] [INFO] Shutting down: Master
2020/07/05 18:45:12 No new connections allowed. Exiting in: 10m0s
rpc error: code = Unknown desc = Error: No such container: ba335b88f0f7862682be186c20d78c0ac46500a3973b2186e93fd76cbba8a80d
Expected Behaviour
When we send a SIGTERM to
fwatchdog
, it should be closed afterfprocess
was gracefully shut down.Current Behaviour
When we send a SIGTERM to
fwatchdog
, it was still running afterfprocess
was gracefully shut down(exit code of fprocess is 0). In other case, if fprocess close with an error(exit code is not 0), watchdog will show a logForked function has terminated: ...
and kill itself.Possible Solution
Exit
fwatchdog
itself whenfprocess
exited. It should relate to https://github.com/openfaas-incubator/of-watchdog/blob/c796e1b714d703c90bf6e3f392471746e0aeab2d/executor/http_runner.go#L70-L76 A idea is exit fwatchdog whatever fprocess died with error or not, but I am not sure is there any side effect about this change.Steps to Reproduce (for bugs)
I am using template python3-flask, and
fprocess
is gunicornfaas template store pull python3-flask
faas new python3-flask-gracefully-shutdown --lang python3-flask
echo 'gunicorn' > python3-flask-gracefully-shutdown/requirements.txt
faas build -f python3-flask-gracefully-shutdown.yml
docker run --rm -ti -e fprocess='gunicorn index:app' -p 8080:8080 --name python3-flask-gracefully-shutdown python3-flask-gracefully-shutdown
docker kill --signal=SIGTERM python3-flask-gracefully-shutdown
Context
I am using OpenFaaS for some long-running task with auto-scaling on Kubernetes. One day I found that there are a lot of Pods did not terminate correctly(status is Terminating), which means it was wasting resource.
Your Environment
docker version
(e.g. Docker 17.0.05 ): Docker version 19.03.8, build afacb8b