redpanda-data / helm-charts

Redpanda Helm Chart
http://redpanda.com
Apache License 2.0
77 stars 96 forks source link

redpanda: make lifecycle hooks debuggable #1560

Closed chrisseto closed 1 month ago

chrisseto commented 1 month ago

Prior to this commit debugging issues with our lifecycle hooks was next to impossible. This is primarily due to Kubernetes providing little to no output about them except in the case of failure. Our hooks are wrapped with ; true to ensure failure never happens making the entire issue worse.

This commit adds a more complex wrapper around the PostStart and PreStop hooks which causes all output from the hooks to be output to stdout of the redpanda process so it appears in kubectl logs with a timestamp and prefix indicating which hook it is.

Example output from kubectl logs -f on a terminating node:

INFO  2024-10-10 18:23:02,637 [shard 0:main] cluster - members_table.cc:258 - marking node 2 in maintenance state
INFO  2024-10-10 18:23:02,637 [shard 0:main] cluster - drain_manager.cc:54 - Node draining is starting
INFO  2024-10-10 18:23:02,637 [shard 0:main] cluster - drain_manager.cc:150 - Node draining has started
INFO  2024-10-10 18:23:02,637 [shard 0:main] cluster - drain_manager.cc:183 - Node draining has completed on shard 0
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + touch /tmp/preStopHookStarted
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + source /var/lifecycle/common.sh
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_URL=https://redpanda-2.redpanda.default.svc.cluster.local:9644
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_NODE_ID_CMD='curl --silent --fail --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/node_config'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_MAINTENANCE_DELETE_CMD_PREFIX='curl -X DELETE --silent -o /dev/null -w "%{http_code}"'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_MAINTENANCE_PUT_CMD_PREFIX='curl -X PUT --silent -o /dev/null -w "%{http_code}"'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ CURL_MAINTENANCE_GET_CMD='curl -X GET --silent --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/maintenance'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + set -x
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + preStopHook
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ curl --silent --fail --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/node_config
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '\"node_id\":[^,}]*'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '[^: ]*$'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + NODE_ID=2
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + echo 'Setting maintenance mode on node 2'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: Setting maintenance mode on node 2
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + CURL_MAINTENANCE_PUT_CMD='curl -X PUT --silent -o /dev/null -w "%{http_code}" --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/brokers/2/maintenance'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' '' = '"200"' ']'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ curl -X PUT --silent -o /dev/null -w '"%{http_code}"' --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/brokers/2/maintenance
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + status='"200"'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + sleep 0.5
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' '"200"' = '"200"' ']'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' '' = true ']'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' '' = false ']'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ curl -X GET --silent --cacert /etc/tls/certs/default/ca.crt https://redpanda-2.redpanda.default.svc.cluster.local:9644/v1/maintenance
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + res='{"draining": true, "finished": true, "errors": false, "partitions": 2, "eligible": 0}'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ echo '{"draining":' true, '"finished":' true, '"errors":' false, '"partitions":' 2, '"eligible":' '0}'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '\"finished\":[^,}]*'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '[^: ]*$'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + finished=true
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ echo '{"draining":' true, '"finished":' true, '"errors":' false, '"partitions":' 2, '"eligible":' '0}'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '\"draining\":[^,}]*'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: ++ grep -o '[^: ]*$'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + draining=true
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + sleep 0.5
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + '[' true = true ']'
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + touch /tmp/preStopHookFinished
lifecycle-hook Thu Oct 10 18:23:02 UTC 2024 pre-stop: + true
INFO  2024-10-10 18:23:03,400 [shard 0:main] main - application.cc:466 - Stopping...
chrisseto commented 1 month ago

Ah! Good catch. I thought that logic was handled by the config watcher 🤔

Also isn't the syntax for {PASSWORD} incorrect?

chrisseto commented 1 month ago

I've removed the referenced part of the prestart hook and confirmed that SASL still works in either a freshly created cluster or an upgraded one.