Open bdols opened 1 year ago
Just to make sure that I understand correctly, this is effectively an ask to ignore certain PDBs when evaluating nodes for eviction because this PDB is in place so that Strimzi can handle the eviction process when a node eviction is requested and not the standard eviction manager?
This definitely seems like an interesting use-case but I'm curious how common this use-case is in the ecosystem.
does not necessarily correspond to the replicas being in sync
It seems odd to me that Kafka readiness wouldn't be based on the pod actually being fully replicated and being in-sync. Is this something that is being considered upstream in Kafka so that this node drain cleaner component doesn't have to be built on top of the existing eviction manager to handle this case?
Just to make sure that I understand correctly, this is effectively an ask to ignore certain PDBs when evaluating nodes for eviction because this PDB is in place so that Strimzi can handle the eviction process when a node eviction is requested and not the standard eviction manager?
yes. I am not speaking for their project or anything.
This definitely seems like an interesting use-case but I'm curious how common this use-case is in the ecosystem.
I can think of another possible use. ElasticSearch has an option in its shutdown API for removing/replacing nodes versus an in-place restart, which is more of a consideration for local HostPath volumes vs EBS. https://www.elastic.co/guide/en/elasticsearch/reference/current/put-shutdown.html
If karpenter deprovisioning was configured with ttlSecondsUntilExpired, a 'replace' would be appropriate instead of the 'restart' default, but currently, it appears that ECK only allows for one setting via an env var on the running container: https://www.elastic.co/guide/en/cloud-on-k8s/2.8/k8s-prestop.html
There may be other HA data storage services that handle their own orchestration. I had a thought that PodSpec could be enhanced with something like a "disruptableness" to indicate whether an application is not able to handle a disruption while still be ready to service requests. Or, including the Elastic case, now, "drainableness". For Kafka , I could see a pod marking itself as "drainable" once all the replicas are rebalanced off.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen
@jonathan-innis: Reopened this issue.
This issue is currently awaiting triage.
If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Description
What problem are you trying to solve?
For Strimzi's Kafka Operator, to achieve HA, it's important to let its operator drain pods based on its internal readiness. To do this, Strimzi recommends using their Drain Cleaner (Strimzi blog post announcement) , which essentially creates a ValidatingWebhook for pod evictions and adds an annotation to the pods (kafka and zookeeper) that it manages. This then means that to make use of the Drain Cleaner that a PDB with maxUnavailable of 0 needs to be set, and, so, this does not allow for karpenter to consolidate any node with kafka or zookeeper pods running on them. Karpenter creates a DeprovisioningBlocked Event for this: "Cannot deprovision node due to pdb--zookeeper prevents pod evictions"
Some potential features that could address this: provide a capability for the Provisioner to set a pod annotation to mark a pod for eviction for specific workloads which then may not require the Strimzi Drain Cleaner. I imagine that this would need an accompanying timeout for successful eviction on cordoned nodes. Or, a Provisioner setting to ignore/override PDBs and drain the node anyway, but seems like that would break the PDB contract.
How important is this feature to you?
The workaround for this is to have dedicated hardware sized for the workloads along with a dedicated provisioner that set node taints, but zookeeper, for example, may not require a lot of CPU/memory so this can be considered idle waste. Ideally, it would be better to have a choice on whether these workloads can run on nodes capable of handling other workloads.