Allow for graceful delete through a pod label

aallawala commented 4 months ago

Summary

This PR introduces a new pod-level label that can be added by any external operator in order to gracefully terminate a pod without causing unavailability for Mimir across a set of statefulsets (ingesters or store-gateways).

I've chosen to explicitly check for delete-label pods only if there are no pods requiring a rotation due to a statefulset deploy. Therefore, priority is maintained for rollouts first for a given statefulset before deleting.

I also chose label instead of the original issue calling for an annotation to be able to reuse the selector code paths. I've found this to be more idiomatic when searching for Kubernetes resources.

I'm happy to explore annotations if there's a strong desire to go that way, it would just require pulling down all the pods and performing the filtering in code as opposed to the Kubernetes API.

I'm opening this PR for feedback. PTAL and let me know what changes I can make. Thank you.

Testing

I added three unit tests for three different behaviors:

should delete pods that are labeled for deletion when all statefulsets are up-to-date
should delete pods that are labeled for deletion for an up-to-date statefulset during rollout
should ignore labeled pods and delete those that are not running up-to-date revision during rollout

I also tested this in our internal infrastructure:

Command issued:

$ kubectl --context mesh-obs-a-ea1-us -n test-2 label pods ingester-zone-b-test-2-0 grafana.com/graceful-shutdown=delete-pod 
pod/ingester-zone-b-test-2-0 labeled

Logs:

level=info ts=2024-08-02T04:42:44.800730171Z msg="reconcile started"
level=debug ts=2024-08-02T04:42:44.801069706Z msg="reconciling StatefulSet" statefulset=ingester-zone-a-test-2
level=debug ts=2024-08-02T04:42:44.80109812Z msg="reconciling StatefulSet" statefulset=ingester-zone-b-test-2
level=debug ts=2024-08-02T04:42:44.801116072Z msg="found 1 pod(s) marked for graceful termination"
level=info ts=2024-08-02T04:42:44.801120419Z msg="terminating pod ingester-zone-b-test-2-0"
level=debug ts=2024-08-02T04:42:44.812272065Z msg="reconciling StatefulSet" statefulset=store-gtwy-zone-a-test-2
level=debug ts=2024-08-02T04:42:44.81230467Z msg="reconciling StatefulSet" statefulset=store-gtwy-zone-b-test-2
level=debug ts=2024-08-02T04:42:44.812323907Z msg="reconciling StatefulSet" statefulset=store-gtwy-zone-c-test-2
level=info ts=2024-08-02T04:42:44.812344139Z msg="reconcile done"
level=info ts=2024-08-02T04:42:49.812925318Z msg="reconcile started"
level=info ts=2024-08-02T04:42:49.8132528Z msg="StatefulSet status is reporting all pods ready, but the rollout operator has found some not-Ready pods" statefulset=ingester-zone-b-test-2 not_ready_pods=ingester-zone-b-test-2-0
level=info ts=2024-08-02T04:42:49.813265063Z msg="a StatefulSet has some not-Ready pods, reconcile it first" statefulset=ingester-zone-b-test-2
level=debug ts=2024-08-02T04:42:49.813267779Z msg="reconciling StatefulSet" statefulset=ingester-zone-b-test-2
level=debug ts=2024-08-02T04:42:49.813291663Z msg="found 1 pod(s) marked for graceful termination"
level=debug ts=2024-08-02T04:42:49.813295854Z msg="waiting for pod ingester-zone-b-test-2-0 to be terminated"
level=debug ts=2024-08-02T04:42:49.813316609Z msg="reconciling StatefulSet" statefulset=store-gtwy-zone-a-test-2
level=debug ts=2024-08-02T04:42:49.81333483Z msg="reconciling StatefulSet" statefulset=store-gtwy-zone-b-test-2
level=debug ts=2024-08-02T04:42:49.81335237Z msg="reconciling StatefulSet" statefulset=store-gtwy-zone-c-test-2
level=info ts=2024-08-02T04:42:49.813369711Z msg="reconcile done"
level=info ts=2024-08-02T04:43:09.816083005Z msg="reconcile started"
level=debug ts=2024-08-02T04:43:09.816436732Z msg="reconciling StatefulSet" statefulset=store-gtwy-zone-a-test-2
level=debug ts=2024-08-02T04:43:09.816476657Z msg="reconciling StatefulSet" statefulset=store-gtwy-zone-b-test-2
level=debug ts=2024-08-02T04:43:09.816496851Z msg="reconciling StatefulSet" statefulset=store-gtwy-zone-c-test-2
level=info ts=2024-08-02T04:43:09.816525269Z msg="a StatefulSet has some not-Ready pods, reconcile it first" statefulset=ingester-zone-b-test-2
level=debug ts=2024-08-02T04:43:09.816527614Z msg="reconciling StatefulSet" statefulset=ingester-zone-b-test-2
level=debug ts=2024-08-02T04:43:09.816543425Z msg="found 1 pod(s) marked for graceful termination"
level=info ts=2024-08-02T04:43:09.81654687Z msg="StatefulSet has some pods to be updated but maxUnavailable pods has been reached" statefulset=ingester-zone-b-test-2 pods_to_update=1 replicas=1 ready_replicas=0 max_unavailable=1
level=info ts=2024-08-02T04:43:09.816553377Z msg="reconcile done"

Fixes

https://github.com/grafana/rollout-operator/issues/148

Reviewers

@pracucci @rishabhkumar92

charleskorn commented 3 months ago

Thanks for the PR @aallawala!

I am wondering if a different approach would solve this problem - not just for your scenario but for all scenarios where a PDB is inadequate to describe what level of disruption a workload can tolerate. For example, we have a similar issue at Grafana Labs when performing Kubernetes node pool upgrades, where the PDB only allows us to stop a single pod from a workload at a time, even if it would be safe to stop multiple pods at once.

My idea is this: instead of using annotations and the rollout-operator's reconciliation loop to delete pods on demand, what if we instead implemented a validating webhook that runs on pod eviction requests for pods managed by rollout-operator? This would check if it is safe for that pod to be evicted, based on the rules that rollout-operator enforces for rollouts, and block the eviction if it is not safe to stop the pod right now.

This would effectively give us a custom PDB that is aware of the rules that rollout-operator enforces, and this means that tooling like kubectl drain will respect our desired behaviour. And, assuming the tooling you describe in https://github.com/grafana/rollout-operator/issues/148 uses the eviction API to drain nodes before recycling them, it would work for your scenario as well.

(This Stack Overflow answer has a good overview of the difference between pod deletion and eviction requests.)

How does that sound?

aallawala commented 3 months ago

Hi @charleskorn, thanks so much for taking a look!

I'm familiar with eviction requests. That's actually what is causing us issues as the vanilla Evict API will honor the PDB set.

Our internal k8s infra layer has a ValidatingWebhook that intercepts Evict API calls and will either annotate or label a pod. Once annotated or labeled, it's up to an external control loop to handle the evict, hence this PR. My approach is to integrate with this internal Evict API receiver and allow another control loop (in this instance, the rollout-operator) to perform the evictions safely for us.

I like your idea a lot, but that will certainly expand the scope of this change and I may not be able to get to it right away. Is this something you can take on? I'd also like to consult with our team internally to determine whether this is feasible for us.

For instance, I'm unsure how this may plug in with an existing controller that intercepts Evict API calls. Would K8s be able to differentiate which controller is responsible for handling the Evict API if there are two interceptors? (I can find this out internally as well, just an open question right now).

charleskorn commented 3 months ago

I like your idea a lot, but that will certainly expand the scope of this change and I may not be able to get to it right away. Is this something you can take on?

Unfortunately this isn't a priority for me at the moment. If you're happy to have a go at it, then I'd be happy to review a PR once you've had a chance to pull it together.

For instance, I'm unsure how this may plug in with an existing controller that intercepts Evict API calls. Would K8s be able to differentiate which controller is responsible for handling the Evict API if there are two interceptors? (I can find this out internally as well, just an open question right now).

I assume the Eviction API behaves the same as any other Kubernetes API with regards to multiple admission webhooks, and that both webhooks would be invoked to mutate or validate the request. It's also possible to set rules on the webhook configuration (docs) that limit when the webhook is invoked, so that might be useful here too.

aallawala commented 3 months ago

Totally understand, @charleskorn.

Does that preclude this change altogether (where a labeled pod can be picked up by the rollout operator to enact the deletion)?

charleskorn commented 3 months ago

Does that preclude this change altogether (where a labeled pod can be picked up by the rollout operator to enact the deletion)?

I don't think it would make sense to include this change as-is in rollout-operator given it will only solve for your specific use case, while the other idea above would solve the same problem and many other similar problems for all users.

aallawala commented 3 months ago

@charleskorn, I believe what you're asking is different from the original issue, #148.

Perhaps this PR (with additional work, of course) can tackle both problems:

Pod evictions through a validating webhook
Pod evictions when labeled

The code path for both problems will be the same, and both ways can be supported IMO. At the end of the day, the rollout-operator has complete awareness of what a safe eviction looks like for ingesters and store-gateways. So if that path can be ironed out (I'm happy to do that), then only the entrypoints for 1 and 2 will be what differs.

I believe Problem 1 can be expanded upon in a later PR (through the addition of the validating webhook), while I can address Problem 2.

charleskorn commented 3 months ago

Perhaps this PR (with additional work, of course) can tackle both problems:

Pod evictions through a validating webhook

Pod evictions when labeled

I think the two ideas are different to this description:

(my suggestion above) Add a validating webhook to rollout-operator that runs on eviction API requests to allow or block eviction requests for pods belonging to a StatefulSet managed by rollout-operator. The eviction API still does the deletion of the pods itself. This mirrors the behaviour of PDBs.
(this PR) Modify rollout-operator to respond to a label added to pods belonging to a StatefulSet managed by rollout-operator and delete them when it is safe to do so.

My 2c is that option 1 is the preferred option here, as it solves the underlying problem (PDBs can't adequately describe when it is safe to delete pods) for all use cases and all users. Option 2 only solves a problem for users such as yourself that have automation in place to label pods for deletion when desired, and doesn't solve the more general problem.

aallawala commented 3 months ago

I agree that your idea is more optimal. However, our team does not have enough bandwidth to complete that at this time.

Thanks, @charleskorn, for your insight and consideration! I'll close this PR out in favor of an eventual solution towards a validating webhook.

grafana / rollout-operator