Closed aallawala closed 3 months ago
Thanks for the PR @aallawala!
I am wondering if a different approach would solve this problem - not just for your scenario but for all scenarios where a PDB is inadequate to describe what level of disruption a workload can tolerate. For example, we have a similar issue at Grafana Labs when performing Kubernetes node pool upgrades, where the PDB only allows us to stop a single pod from a workload at a time, even if it would be safe to stop multiple pods at once.
My idea is this: instead of using annotations and the rollout-operator's reconciliation loop to delete pods on demand, what if we instead implemented a validating webhook that runs on pod eviction requests for pods managed by rollout-operator? This would check if it is safe for that pod to be evicted, based on the rules that rollout-operator enforces for rollouts, and block the eviction if it is not safe to stop the pod right now.
This would effectively give us a custom PDB that is aware of the rules that rollout-operator enforces, and this means that tooling like kubectl drain
will respect our desired behaviour. And, assuming the tooling you describe in https://github.com/grafana/rollout-operator/issues/148 uses the eviction API to drain nodes before recycling them, it would work for your scenario as well.
(This Stack Overflow answer has a good overview of the difference between pod deletion and eviction requests.)
How does that sound?
Hi @charleskorn, thanks so much for taking a look!
I'm familiar with eviction requests. That's actually what is causing us issues as the vanilla Evict API will honor the PDB set.
Our internal k8s infra layer has a ValidatingWebhook that intercepts Evict API calls and will either annotate or label a pod. Once annotated or labeled, it's up to an external control loop to handle the evict, hence this PR. My approach is to integrate with this internal Evict API receiver and allow another control loop (in this instance, the rollout-operator) to perform the evictions safely for us.
I like your idea a lot, but that will certainly expand the scope of this change and I may not be able to get to it right away. Is this something you can take on? I'd also like to consult with our team internally to determine whether this is feasible for us.
For instance, I'm unsure how this may plug in with an existing controller that intercepts Evict API calls. Would K8s be able to differentiate which controller is responsible for handling the Evict API if there are two interceptors? (I can find this out internally as well, just an open question right now).
I like your idea a lot, but that will certainly expand the scope of this change and I may not be able to get to it right away. Is this something you can take on?
Unfortunately this isn't a priority for me at the moment. If you're happy to have a go at it, then I'd be happy to review a PR once you've had a chance to pull it together.
For instance, I'm unsure how this may plug in with an existing controller that intercepts Evict API calls. Would K8s be able to differentiate which controller is responsible for handling the Evict API if there are two interceptors? (I can find this out internally as well, just an open question right now).
I assume the Eviction API behaves the same as any other Kubernetes API with regards to multiple admission webhooks, and that both webhooks would be invoked to mutate or validate the request. It's also possible to set rules on the webhook configuration (docs) that limit when the webhook is invoked, so that might be useful here too.
Totally understand, @charleskorn.
Does that preclude this change altogether (where a labeled pod can be picked up by the rollout operator to enact the deletion)?
Does that preclude this change altogether (where a labeled pod can be picked up by the rollout operator to enact the deletion)?
I don't think it would make sense to include this change as-is in rollout-operator given it will only solve for your specific use case, while the other idea above would solve the same problem and many other similar problems for all users.
@charleskorn, I believe what you're asking is different from the original issue, #148.
Perhaps this PR (with additional work, of course) can tackle both problems:
The code path for both problems will be the same, and both ways can be supported IMO. At the end of the day, the rollout-operator has complete awareness of what a safe eviction looks like for ingesters and store-gateways. So if that path can be ironed out (I'm happy to do that), then only the entrypoints for 1 and 2 will be what differs.
I believe Problem 1 can be expanded upon in a later PR (through the addition of the validating webhook), while I can address Problem 2.
Perhaps this PR (with additional work, of course) can tackle both problems:
- Pod evictions through a validating webhook
- Pod evictions when labeled
I think the two ideas are different to this description:
My 2c is that option 1 is the preferred option here, as it solves the underlying problem (PDBs can't adequately describe when it is safe to delete pods) for all use cases and all users. Option 2 only solves a problem for users such as yourself that have automation in place to label pods for deletion when desired, and doesn't solve the more general problem.
I agree that your idea is more optimal. However, our team does not have enough bandwidth to complete that at this time.
Thanks, @charleskorn, for your insight and consideration! I'll close this PR out in favor of an eventual solution towards a validating webhook.
Summary
This PR introduces a new pod-level label that can be added by any external operator in order to gracefully terminate a pod without causing unavailability for Mimir across a set of statefulsets (ingesters or store-gateways).
I've chosen to explicitly check for delete-label pods only if there are no pods requiring a rotation due to a statefulset deploy. Therefore, priority is maintained for rollouts first for a given statefulset before deleting.
I also chose label instead of the original issue calling for an annotation to be able to reuse the selector code paths. I've found this to be more idiomatic when searching for Kubernetes resources.
I'm happy to explore annotations if there's a strong desire to go that way, it would just require pulling down all the pods and performing the filtering in code as opposed to the Kubernetes API.
I'm opening this PR for feedback. PTAL and let me know what changes I can make. Thank you.
Testing
I added three unit tests for three different behaviors:
I also tested this in our internal infrastructure:
Command issued:
Logs:
Fixes
https://github.com/grafana/rollout-operator/issues/148
Reviewers
@pracucci @rishabhkumar92