splunk / docker-splunk

Splunk Docker GitHub Repository
462 stars 253 forks source link

Using checkstate.sh as liveness probe leads to failing HEC connections #561

Open reg0bs opened 1 year ago

reg0bs commented 1 year ago

We are using docker-splunk in k8s and therefore use checkstate.sh as liveness probe. The problem is that checkstate.sh executes the following to check if Splunk is still running: curl --max-time 30 --fail --insecure $scheme://localhost:8089/

So it checks if splunkd is still running on port 8089, but this is probably the thing that is available until the very last second if Splunk shutting down so Splunk Web, HECs, Receivers,... they are all already gone when this endpoint will still return 200. As long as this returns 200 the LoadBalancer or something like an ingress-nginx will happily send traffic to the endpoint, leading to timeouts and broken connections.

My proposal to fix this would be to apply the following logic in checkstate.sh:

  1. Check if there are HECs and receivers running
  2. If so, assess the liveness of the container based on the response of these ports and not 8089
  3. If not, stay with the current check and see if 8089 is still available

There may even be better ways to achieve this, maybe someone has an idea?

If we agree on a fix I would be happy to create an MR to solve this.