Support issue with Kubernetes

conan-o-chang commented 4 years ago

Expected Behaviour

When we send a SIGTERM to fwatchdog, it should be closed after fprocess was gracefully shut down.

Current Behaviour

When we send a SIGTERM to fwatchdog, it was still running after fprocess was gracefully shut down(exit code of fprocess is 0). In other case, if fprocess close with an error(exit code is not 0), watchdog will show a log Forked function has terminated: ... and kill itself.

Possible Solution

Exit fwatchdog itself when fprocess exited. It should relate to https://github.com/openfaas-incubator/of-watchdog/blob/c796e1b714d703c90bf6e3f392471746e0aeab2d/executor/http_runner.go#L70-L76 A idea is exit fwatchdog whatever fprocess died with error or not, but I am not sure is there any side effect about this change.

Steps to Reproduce (for bugs)

I am using template python3-flask, and fprocess is gunicorn

Pull faas template: faas template store pull python3-flask
Generate a new project name python3-flask-gracefully-shutdown: faas new python3-flask-gracefully-shutdown --lang python3-flask
Add gunicorn to requirements.txt: echo 'gunicorn' > python3-flask-gracefully-shutdown/requirements.txt
Build images locally: faas build -f python3-flask-gracefully-shutdown.yml
Run container: docker run --rm -ti -e fprocess='gunicorn index:app' -p 8080:8080 --name python3-flask-gracefully-shutdown python3-flask-gracefully-shutdown
Send SIGTERM to the container: docker kill --signal=SIGTERM python3-flask-gracefully-shutdown

The log show that watchdog wait for 10 seconds(default timeout limit) to exit itself after it get SIGTERM, but it should be exited immediately.

Forking - gunicorn [index:app]
2020/06/30 05:50:27 Started logging stderr from function.
2020/06/30 05:50:27 Started logging stdout from function.
2020/06/30 05:50:27 OperationalMode: http
2020/06/30 05:50:27 Timeouts: read: 10s, write: 10s hard: 10s.
2020/06/30 05:50:27 Listening on port: 8080
2020/06/30 05:50:27 Writing lock-file to: /tmp/.lock
2020/06/30 05:50:27 Metrics listening on port: 8081
2020/06/30 05:50:27 stderr: [2020-06-30 05:50:27 +0000] [11] [INFO] Starting gunicorn 20.0.4
2020/06/30 05:50:27 stderr: [2020-06-30 05:50:27 +0000] [11] [INFO] Listening at: http://127.0.0.1:8000 (11)
2020/06/30 05:50:27 stderr: [2020-06-30 05:50:27 +0000] [11] [INFO] Using worker: sync
2020/06/30 05:50:27 stderr: [2020-06-30 05:50:27 +0000] [16] [INFO] Booting worker with pid: 16
2020/06/30 05:50:32 SIGTERM received.. shutting down server in 10s
2020/06/30 05:50:32 Removing lock-file : /tmp/.lock
2020/06/30 05:50:32 stderr: [2020-06-30 05:50:32 +0000] [11] [INFO] Handling signal: term
2020/06/30 05:50:32 stderr: [2020-06-30 05:50:32 +0000] [16] [INFO] Worker exiting (pid: 16)
2020/06/30 05:50:32 stderr: [2020-06-30 05:50:32 +0000] [11] [INFO] Shutting down: Master
2020/06/30 05:50:42 No new connections allowed. Exiting in: 10s

Context

I am using OpenFaaS for some long-running task with auto-scaling on Kubernetes. One day I found that there are a lot of Pods did not terminate correctly(status is Terminating), which means it was wasting resource.

Your Environment

Docker version docker version (e.g. Docker 17.0.05 ): Docker version 19.03.8, build afacb8b
Are you using Docker Swarm or Kubernetes (FaaS-netes)? Kubernetes
Operating System and version (e.g. Linux, Windows, MacOS): MacOS 10.13.6
Link to your project or a code example to reproduce issue:

alexellis commented 4 years ago

What was your write_timeout value? Pods do not die instantly but are configured to stay for a grace period.

I'd like you to show how to reproduce this on Kubernetes, not with an arbitrary docker kill command as that's not necessarily the correct workflow and behaviour.

alexellis commented 4 years ago

/set title: Support issue with Kubernetes

conan-o-chang commented 4 years ago

@alexellis Thank you so much for your comment. The reason is that I set write_timeout of the function too high. After setting write_timeout right, those pod will close correctly. Sorry for misunderstanding the meaning of write_timeout.

There is my testing code: https://github.com/conan-o-chang/openfaas-long-running-task

conan-o-chang commented 4 years ago

Hi @alexellis , sorry for bothering you again. I find that If I set a high write_timeout of my function, the issue sill happening. fwatchdog keep running after fprocess was finished, and it make Kubernetes's pods keep Terminating

Steps to Reproduce (for bugs)

The source code I am using: https://github.com/conan-o-chang/openfaas-long-running-task
Pull template python3-flask: faas template store pull python3-flask
faas up the function: faas up --skip-push -f python3-flask-long-running.yml
Patch terminationGracePeriodSeconds:kubectl -n openfaas-fn patch deploy python3-flask-long-running -p "{\"spec\":{\"template\":{\"spec\":{\"terminationGracePeriodSeconds\": 600}}}}"
Scale up the function: kubectl scale deploy/python3-flask-long-running --replicas=5 -n openfaas-fn

# We get 5 pods of the function.
Every 2.0s: kubectl get po -n openfaas-fn                             Mon Jul  6 02:05:20 2020
NAME                                          READY   STATUS    RESTARTS   AGE
python3-flask-long-running-5b6684b647-dv92b   1/1     Running   0          16s
python3-flask-long-running-5b6684b647-f445m   1/1     Running   0          11s
python3-flask-long-running-5b6684b647-jtvtl   1/1     Running   0          11s
python3-flask-long-running-5b6684b647-nprh5   1/1     Running   0          11s
python3-flask-long-running-5b6684b647-tk2mt   1/1     Running   0          11s

Scale down the function: kubectl scale deploy/python3-flask-long-running --replicas=1 -n openfaas-fn

# 4 pods are terminating
Every 2.0s: kubectl get po -n openfaas-fn                             Mon Jul  6 02:05:35 2020
NAME                                          READY   STATUS        RESTARTS   AGE
python3-flask-long-running-5b6684b647-dv92b   1/1     Running       0          31s
python3-flask-long-running-5b6684b647-f445m   0/1     Terminating   0          26s
python3-flask-long-running-5b6684b647-jtvtl   0/1     Terminating   0          26s
python3-flask-long-running-5b6684b647-nprh5   1/1     Terminating   0          26s
python3-flask-long-running-5b6684b647-tk2mt   1/1     Terminating   0          26s

Check the running pod: kubectl -n openfaas-fn exec -ti pods/python3-flask-long-running-5b6684b647-dv92b -- sh

# both fwatchdog and gunicorn(fprocess) is funning
~ $ ps
PID   USER     TIME  COMMAND
1 app       0:00 fwatchdog
11 app       0:00 {gunicorn} /usr/local/bin/python /home/app/.local/bin/gunicorn index
15 app       0:00 {gunicorn} /usr/local/bin/python /home/app/.local/bin/gunicorn index
20 app       0:00 sh
26 app       0:00 ps

Check the terminating pod: kubectl -n openfaas-fn exec -ti pods/python3-flask-long-running-5b6684b647-f445m -- sh

# fwatchdog is keep running even gunicorn was finished.
~ $ ps
PID   USER     TIME  COMMAND
1 app       0:00 fwatchdog
20 app       0:00 sh
25 app       0:00 ps

Finally, the pods is gone after 10 mins(equal to write_timeout of the function).

Every 2.0s: kubectl get po -n openfaas-fn                             Mon Jul  6 02:15:26 2020
NAME                                          READY   STATUS        RESTARTS   AGE
python3-flask-long-running-5b6684b647-dv92b   1/1     Running       0          10m
python3-flask-long-running-5b6684b647-f445m   0/1     Terminating   0          10m
python3-flask-long-running-5b6684b647-jtvtl   0/1     Terminating   0          10m
python3-flask-long-running-5b6684b647-nprh5   0/1     Terminating   0          10m
python3-flask-long-running-5b6684b647-tk2mt   0/1     Terminating   0          10m

Every 2.0s: kubectl get po -n openfaas-fn                             Mon Jul  6 02:15:50 2020
NAME                                          READY   STATUS    RESTARTS   AGE
python3-flask-long-running-5b6684b647-dv92b   1/1     Running   0          10m

To find out what happen to the pod, I try to delete the running pod for getting the log. kubectl logs -n openfaas-fn deploy/python3-flask-long-running -f

Forking - gunicorn [index:app -b 127.0.0.1:5000 --log-level debug --timeout=600 --graceful-timeout=600]
2020/07/05 18:05:05 Started logging stderr from function.
2020/07/05 18:05:05 Started logging stdout from function.
2020/07/05 18:05:05 OperationalMode: http
2020/07/05 18:05:05 Timeouts: read: 10m0s, write: 10m0s hard: 10m0s.
2020/07/05 18:05:05 Listening on port: 8080
2020/07/05 18:05:05 Writing lock-file to: /tmp/.lock
2020/07/05 18:05:05 Metrics listening on port: 8081
...
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [INFO] Starting gunicorn 20.0.4
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [DEBUG] Arbiter booted
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [INFO] Listening at: http://127.0.0.1:5000 (11)
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [INFO] Using worker: sync
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [15] [INFO] Booting worker with pid: 15
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [DEBUG] 1 workers
2020/07/05 18:35:12 SIGTERM received.. shutting down server in 10m0s
2020/07/05 18:35:12 Removing lock-file : /tmp/.lock
2020/07/05 18:35:12 stderr: [2020-07-05 18:35:12 +0000] [11] [INFO] Handling signal: term
2020/07/05 18:35:12 stderr: [2020-07-05 18:35:12 +0000] [15] [INFO] Worker exiting (pid: 15)
2020/07/05 18:35:12 stderr: [2020-07-05 18:35:12 +0000] [11] [INFO] Shutting down: Master
2020/07/05 18:45:12 No new connections allowed. Exiting in: 10m0s
rpc error: code = Unknown desc = Error: No such container: ba335b88f0f7862682be186c20d78c0ac46500a3973b2186e93fd76cbba8a80d

openfaas / of-watchdog