Add probes for cache statefulsets

Sabian-A commented 4 weeks ago

Is your feature request related to a problem? Please describe.

I've encountered an operational challenge in monitoring and ensuring the health of the caches (result, chunks, index,metadata )StatefulSets efficiently. Currently, there's no straightforward method to integrate Kubernetes readiness and liveness probes directly through the Helm chart configurations. This makes it difficult to automatically manage the healthy state and readiness of our deployed services, impacting our ability to reliably scale and maintain our infrastructure.

Describe the solution you'd like

I would like the Helm chart for the cache StatefulSets to support configuration options that enable the easy integration of Kubernetes probes.

Describe alternatives you've considered

An alternative solution was manually modifying the deployment templates post-Helm generation to include these probes, but this approach is not maintainable or scalable as it bypasses the advantages of using Helm for deployment configurations. Using pre-hooks in Helm to modify deployments post-deployment was also considered but dismissed for the same reasons.

56quarters commented 3 weeks ago

Which Kubernetes checks specifically do you want the caches to have and why?

In my experience they're not useful for Memcached:

Readiness check: Memcached starts in a fraction of a second.
Startup check: Again not useful because Memcached starts in a fraction of a second.
Liveness check: I don't see how this would be useful for Memcached and liveness checks in general seem like huge liability with the potential to make minor issues worse.

Sabian-A commented 3 weeks ago

@56quarters Thank you for getting back to me! I specifically need the Readiness check because, in our GKE setup, the Pod Disruption Budget (PDB) relies on the readiness probe to determine if a pod is healthy and can handle traffic. Without the readiness check, the PDB has limited visibility into the pod’s true health status, only knowing that it’s up and running but not if it’s fully ready to serve requests.

56quarters commented 3 weeks ago

@56quarters Thank you for getting back to me! I specifically need the Readiness check because, in our GKE setup, the Pod Disruption Budget (PDB) relies on the readiness probe to determine if a pod is healthy and can handle traffic. Without the readiness check, the PDB has limited visibility into the pod’s true health status, only knowing that it’s up and running but not if it’s fully ready to serve requests.

OK, that sounds reasonable if you'd like to open a PR To add a TCP readiness probe to Memcached instances in the helm chart. Thanks!

Sabian-A commented 4 days ago

@56quarters https://github.com/grafana/mimir/pull/9990

56quarters commented 2 hours ago

Closing this issue for the reasons detailed in this PR, thanks. https://github.com/grafana/mimir/pull/9990#pullrequestreview-2461707896

grafana / mimir