Review livenessProbe for pod startup or reduce the number of Splunk restarts in ansible playbook

romain-bellanger commented 3 years ago

We recently (since upgrade from Splunk 8.0 to 8.1.0.1) experienced an issue with liveness probe failing before completion of the startup ansible playbook on cluster-master pod.

The liveness probe is configured with:

      failureThreshold: 3
      initialDelaySeconds: 300
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 30

This gives 6 minutes to the pod to start. On some of our multisite clusters, the startup playbook triggers 4 Splunk restarts on the cluster-master pod, each taking approximately 50s, and was taking 7 minutes to complete, causing Kubernetes to reschedule the pod. We've patched the exporter to extend the initialDelaySeconds to 450 to work around this issue.

Several actions could be considered:

Extend the initialDelaySeconds of the livenessProbe. Note: the risk of impact (pods not responding after startup possibly not rescheduled before several minutes) could be limited by using a startupProbe instead of extending this delay, but this is only supported from Kubernetes 1-16. We still have clusters with older version.
Review the ansible playbooks to reduce the number of restarts
Make the playbook idempotent to avoid rerunning all the steps (and triggering all the Splunk restarts) at every startup when the configuration is already in place on persistent storage. At least this could allow the playbook to complete faster on second attempt, while in our case the playbook reruns the same steps over and over, always taking 7 minutes to complete.

pogdin commented 2 years ago

CSPL-569

akondur commented 1 year ago

Hi @romain-bellanger , we have made the livenessProbe and readinessProbe configurable with the latest release. Could you please check and let us know if we can close this issue?

splunk / splunk-operator

Review livenessProbe for pod startup or reduce the number of Splunk restarts in ansible playbook #233