Open nullren opened 2 weeks ago
conceptually this could definitely be something that exists in kubernetes directly because the pattern of "allowing zonal disruptions" is not unique to mimir. eg, an elasticsearch cluster that has documents replicated across "zones" would benefit from this same controller...
perhaps this is something the https://github.com/grafana/rollout-operator could manage?
In Mimir 2.14 the team added the support for puting ingesters into a read-only mode (docs). The documentation on scaling ingesters down was also updated, mentioning the mechanics of multi-zonal deployment (docs). Would this help with what you outlined in the issue?
Perhaps this is something the https://github.com/grafana/rollout-operator could manage?
The mechanics outlined in the documentation are, indeed, supported by the rollout-operator
. We have it codified in jsonnet. That's what we run internally at Grafana Labs.
Is your feature request related to a problem? Please describe.
When deploying Mimir to K8s, some Pod Disruption Budgets (PDBs) are created for some pod types (distributors, ingesters, etc), however, they tend to be too restrictive—I think something like allowing only 1 disruption.
Anyway, because metrics are replicated across zones, there isn't a clear way to define a PDB that allows for more disruptions safely.
Describe the solution you'd like
It would be nice if there was some way to have a "high level PDB" where zones can be disrupted. A "zone" would be "healthy" or "up" if all pods in that zone are healthy/up. So, a disrupted zone would be one where at least 1 pod is unhealthy.
So, what that might enable is something like having a "ZDB" where you have rule for a majority of zones to be available/undisrupted. This would allow you to disrupt a single zone (eg, all pods in that zone). This would speed up draining k8s nodes since you can safely disrupt 1/3 total pods which is really important/helpful when running many pods.
This might be accomplished via some sort of controller/operator.
For example, we have a cluster with 420 ingester pods—having the PDB where only 1 pod means at a maximum, we can only drain 1 k8s node at a time when this could be done much more quickly (and safely).
Describe alternatives you've considered
This might be something we'll have to create ourselves because (ironically) it's very disruptive.