openfaas / of-watchdog

Reverse proxy for STDIO and HTTP microservices
MIT License
259 stars 115 forks source link

Speed up shutdown according to number of connections #125

Closed alexellis closed 2 years ago

alexellis commented 2 years ago

Signed-off-by: Alex Ellis (OpenFaaS Ltd) alexellis2@gmail.com

Description

Speed up shutdown according to number of connections

Motivation and Context

Prior to this change the watchdog would attempt to wait for the write_timeout value twice, even if there were no connections in flight.

This adds a new metric for in-flight connections and removes the explicit wait.

This change is a prerequisite for a change in faas-netes https://github.com/openfaas/faas-netes/issues/853

How Has This Been Tested?

Testing with killall of-watchdog -s SIGTERM performs as expected, and clients that already connected complete their work before the termination.

go build && exec_timeout=10500ms mode=streaming \
  write_timeout=10500ms healthcheck_interval=1s \
  fprocess="sleep 10" ./of-watchdog

hey -c 10 http://127.0.0.1:8080 -n 100

curl -s localhost:8081/metrics|grep "^http"

http_request_duration_seconds_count{code="200",method="get"} 20
http_requests_in_flight 10
http_requests_total{code="200",method="get"} 20

2021/11/02 11:36:21 Started logging: stderr from function.
2021/11/02 11:36:21 Took 10.005495 secs
2021/11/02 11:36:21 Running sleep
2021/11/02 11:36:21 Started logging: stderr from function.
No new connections allowed, draining: 10 requests
2021/11/02 11:36:31 Took 10.002719 secs
2021/11/02 11:36:31 Took 10.007865 secs
2021/11/02 11:36:31 Took 10.006346 secs
2021/11/02 11:36:31 Took 10.005582 secs
2021/11/02 11:36:31 Took 10.005962 secs
2021/11/02 11:36:31 Took 10.002506 secs
2021/11/02 11:36:31 Took 10.002615 secs
2021/11/02 11:36:31 Took 10.002946 secs
2021/11/02 11:36:31 Took 10.003310 secs
2021/11/02 11:36:31 Took 10.003440 secs
Exiting. Active connections: 0

Types of changes

Checklist:

alexellis commented 2 years ago

Within a container, using HTTP mode and the Go middleware template, I see lots of context cancellations instead of the above behaviour:

Apparently, this should not happen:

Shutdown gracefully shuts down the server without interrupting any active connections. Shutdown works by first closing all open listeners, then closing all idle connections, and then waiting indefinitely for connections to return to idle and then shut down. If the provided context expires before the shutdown is complete, Shutdown returns the context’s error, otherwise it returns any error returned from closing the Server’s underlying Listener(s).

https://pkg.go.dev/net/http#Server.Shutdown

kubectl logs -n openfaas-fn deploy/go-long -f
Found 2 pods, using pod/go-long-545bf8dcd6-58w7b
2021/11/02 12:44:32 Version: 0.8.1-14-gf01c5fa-dirty-1635857004 SHA: f01c5fa6c764bc29fae990cc6fa9fed1cb635bda
2021/11/02 12:44:32 Forking: ./handler, arguments: []
2021/11/02 12:44:32 Started logging: stderr from function.
2021/11/02 12:44:32 Started logging: stdout from function.
2021/11/02 12:44:32 Watchdog mode: http
2021/11/02 12:44:32 Timeouts: read: 2m0s write: 2m0s hard: 2m0s health: 5s
2021/11/02 12:44:32 Listening on port: 8080
2021/11/02 12:44:32 Writing lock-file to: /tmp/.lock
2021/11/02 12:44:32 Metrics listening on port: 8081
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:45 SIGTERM: no new connections in 5s
2021/11/02 12:44:45 [entrypoint] SIGTERM: no connections in: 5s
2021/11/02 12:44:45 Removing lock-file : /tmp/.lock
2021/11/02 12:44:50 No new connections allowed, draining: 5 requests
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error:  context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error:  context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error:  context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error:  context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error:  context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Exiting. Active connections: 0
rpc error: code = NotFound desc = an error occurred when try to find container "7c2ac81501158e5edf30ff4611ecf40d21de9c5004a9fc869bb8695a88a760b7": not found

Something appears to be causing the context on the HTTP request to get cancelled instead of wait until completion.

alexellis commented 2 years ago

I can reproduce a similar error with the Go template running outside of a container:

image

12 seconds after the shutdown is started, something causes those contexts to get cancelled.

go build && exec_timeout=2m upstream_url=http://127.0.0.1:8082 mode=http write_timeout=2m10s healthcheck_interval=2s fprocess="./handler" ./of-watchdog
alexellis commented 2 years ago

The cancellations were caused by the default timeout in hey which was 10s.

The following got around the problem:

hey -t 120 # 120 seconds

Thanks to @LucasRoesler for asking the question of whether the client was causing a cancellation.