We recently (since upgrade from Splunk 8.0 to 8.1.0.1) experienced an issue with liveness probe failing before completion of the startup ansible playbook on cluster-master pod.
This gives 6 minutes to the pod to start. On some of our multisite clusters, the startup playbook triggers 4 Splunk restarts on the cluster-master pod, each taking approximately 50s, and was taking 7 minutes to complete, causing Kubernetes to reschedule the pod. We've patched the exporter to extend the initialDelaySeconds to 450 to work around this issue.
Several actions could be considered:
Extend the initialDelaySeconds of the livenessProbe. Note: the risk of impact (pods not responding after startup possibly not rescheduled before several minutes) could be limited by using a startupProbe instead of extending this delay, but this is only supported from Kubernetes 1-16. We still have clusters with older version.
Review the ansible playbooks to reduce the number of restarts
Make the playbook idempotent to avoid rerunning all the steps (and triggering all the Splunk restarts) at every startup when the configuration is already in place on persistent storage. At least this could allow the playbook to complete faster on second attempt, while in our case the playbook reruns the same steps over and over, always taking 7 minutes to complete.
Hi @romain-bellanger , we have made the livenessProbe and readinessProbe configurable with the latest release. Could you please check and let us know if we can close this issue?
We recently (since upgrade from Splunk 8.0 to 8.1.0.1) experienced an issue with liveness probe failing before completion of the startup ansible playbook on cluster-master pod.
The liveness probe is configured with:
This gives 6 minutes to the pod to start. On some of our multisite clusters, the startup playbook triggers 4 Splunk restarts on the cluster-master pod, each taking approximately 50s, and was taking 7 minutes to complete, causing Kubernetes to reschedule the pod. We've patched the exporter to extend the
initialDelaySeconds
to 450 to work around this issue.Several actions could be considered:
initialDelaySeconds
of the livenessProbe. Note: the risk of impact (pods not responding after startup possibly not rescheduled before several minutes) could be limited by using a startupProbe instead of extending this delay, but this is only supported from Kubernetes 1-16. We still have clusters with older version.