Closed alexellis closed 2 years ago
Within a container, using HTTP mode and the Go middleware template, I see lots of context cancellations instead of the above behaviour:
Apparently, this should not happen:
Shutdown gracefully shuts down the server without interrupting any active connections. Shutdown works by first closing all open listeners, then closing all idle connections, and then waiting indefinitely for connections to return to idle and then shut down. If the provided context expires before the shutdown is complete, Shutdown returns the context’s error, otherwise it returns any error returned from closing the Server’s underlying Listener(s).
https://pkg.go.dev/net/http#Server.Shutdown
kubectl logs -n openfaas-fn deploy/go-long -f
Found 2 pods, using pod/go-long-545bf8dcd6-58w7b
2021/11/02 12:44:32 Version: 0.8.1-14-gf01c5fa-dirty-1635857004 SHA: f01c5fa6c764bc29fae990cc6fa9fed1cb635bda
2021/11/02 12:44:32 Forking: ./handler, arguments: []
2021/11/02 12:44:32 Started logging: stderr from function.
2021/11/02 12:44:32 Started logging: stdout from function.
2021/11/02 12:44:32 Watchdog mode: http
2021/11/02 12:44:32 Timeouts: read: 2m0s write: 2m0s hard: 2m0s health: 5s
2021/11/02 12:44:32 Listening on port: 8080
2021/11/02 12:44:32 Writing lock-file to: /tmp/.lock
2021/11/02 12:44:32 Metrics listening on port: 8081
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:40 Start sleep for: 1m0s
2021/11/02 12:44:45 SIGTERM: no new connections in 5s
2021/11/02 12:44:45 [entrypoint] SIGTERM: no connections in: 5s
2021/11/02 12:44:45 Removing lock-file : /tmp/.lock
2021/11/02 12:44:50 No new connections allowed, draining: 5 requests
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error: context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error: context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error: context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error: context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Upstream HTTP request error: Get "http://127.0.0.1:8082/": context canceled
2021/11/02 12:45:01 Context error: context canceled
2021/11/02 12:45:01 Upstream HTTP killed due to exec_timeout: 2m0s
2021/11/02 12:45:01 Exiting. Active connections: 0
rpc error: code = NotFound desc = an error occurred when try to find container "7c2ac81501158e5edf30ff4611ecf40d21de9c5004a9fc869bb8695a88a760b7": not found
Something appears to be causing the context on the HTTP request to get cancelled instead of wait until completion.
I can reproduce a similar error with the Go template running outside of a container:
12 seconds after the shutdown is started, something causes those contexts to get cancelled.
go build && exec_timeout=2m upstream_url=http://127.0.0.1:8082 mode=http write_timeout=2m10s healthcheck_interval=2s fprocess="./handler" ./of-watchdog
The cancellations were caused by the default timeout in hey
which was 10s.
The following got around the problem:
hey -t 120 # 120 seconds
Thanks to @LucasRoesler for asking the question of whether the client was causing a cancellation.
Signed-off-by: Alex Ellis (OpenFaaS Ltd) alexellis2@gmail.com
Description
Speed up shutdown according to number of connections
Motivation and Context
Prior to this change the watchdog would attempt to wait for the write_timeout value twice, even if there were no connections in flight.
This adds a new metric for in-flight connections and removes the explicit wait.
This change is a prerequisite for a change in faas-netes https://github.com/openfaas/faas-netes/issues/853
How Has This Been Tested?
Testing with killall of-watchdog -s SIGTERM performs as expected, and clients that already connected complete their work before the termination.
Types of changes
Checklist:
git commit -s