Open runningman84 opened 1 year ago
Can you describe your use-case more? Would it not be possible to pass the annotation using the spec.template
in the deployment?
If you use Kyverno, there's an existing Kyverno policy to add the do-not-evict annotation to Job/CronJob pods automatically. https://kyverno.io/policies/karpenter/add-karpenter-donot-evict/add-karpenter-donot-evict/
Were considering this request as part of a batch of eviction based features. @njtran to follow up.
Just ran into a similar issue, but with self-hosted CircleCI pods. They basically act the same as jobs (should only be run once). When Karpenter kills the node the pods are killed and don't come back. This issue was took a long time to figure out because I saw exit code 137 events from containerd, which normally indicates OOM, but in this case the culprit was actually Karpenter.
This is worth a warning in the docs, IMO
We create a PodDisruptionBudget which allows for zero disruptions and has all such jobs in its selector.
@johngmyers , can you post an example of the PDB you're using? I'm a big fan of driving eviction behavior through the PDB API, if at all possible.
I don't have access to it at the moment. It has a large spec.minAvailable
and has a selector for a label that we put on all such jobs. It basically makes the label equivalent to the do-not-evict annotation, except being recognized by the voluntary eviction API.
We have the same problem with tekton pipline taskrun pods. They get evicted after karpenter has cordoned a node. The PipeLine has the annotation "karpenter.sh/do-not-evict: true" and passes it to the taskruns, but some taskrun pods got still evicted, after karapenter consolidates a node.
We have the same problem with cronjobs. We're running jobs on some dedicated nodes, if we use "karpenter.sh/do-not-evict: true" or PDB, Karpenter just skips nodes and does nothing. Nodes are never deprovisioned. The deprovisioning flow for cronjobs should be something like this: cordon nodes -> waiting for cronjobs finish -> drain nodes.
We have the same problem with cronjobs. We're running jobs on some dedicated nodes, if we use "karpenter.sh/do-not-evict: true" or PDB, Karpenter just skips nodes and does nothing. Nodes are never deprovisioned. The deprovisioning flow for cronjobs should be something like this: cordon nodes -> waiting for cronjobs finish -> drain nodes.
100% agree with this
My understanding is that a voluntary eviction will succeed on a pod inSucceeded
or Failed
state. So a PDB would work as desired on pods from Jobs.
Perhaps karpenter.sh/do-not-evict: true
is incorrectly failing to evict pods in those states?
For comments in this issue regarding pods with do-not-evict
being ignored, it's possible you were running into an issue fixed in v0.31.1 (PR: https://github.com/aws/karpenter-core/pull/583).
For others who are experiencing continual behavior of job pods terminating and rescheduling with do-not-evict
blocking node disruption, we're tracking this effort through the first two issues in the mega issue here: https://github.com/aws/karpenter-core/issues/624
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
We could try to set a label on Jobs (and their Pods) to declare whether or not the Pods are trivially interruptible. That allows people to opt in to the behavior where Job-owned Pods don't block a node replacement.
Eventually, supplant this with a MutatingAdmissionPolicy (these are very new) or other mechanism, so that you get per-namespace control to select which Jobs do or don't get marked as interruptible.
That feels like it'd fit Karpenter / node autoscaling well. With that architecture, Karpenter doesn't have to know about the Job API; it just has to look at the Pods bound to a node and check their labels.
karpenter.sh/do-not-evict
is not the same as example.kubernetes.io/interruptible
, and it's not even the exact opposite in meaning. It would be weird to set both of them on a Pod, though.
If we want that label key defined, Kubernetes can help.
Curious if a "disruption cost" annotation on the pod could help make progress towards this class of problem. I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well.
/remove-lifecycle stale
Curious if a "disruption cost" annotation on the pod could help make progress towards this class of problem. I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well.
We have controller.kubernetes.io/pod-deletion-cost
supported in other parts of the project.
One way Karpenter could support that: have a NodePool-level setting that maps Pod deletion cost values to actual money.
Either a scale factor (complex to teach), or a CEL expression (really complex to teach, if I'm honest). It needs to be NodePool level in case you have on-prem and cloud NodePools in the same cluster.
Related to this, maybe we'd like to add a node deletion cost annotation (which, like Pod deletion cost, would be an abstract integer value).
What is cost in units of for the current use? Isn't the node deletion cost the sum of the pods? I could potentially see a config value in NodePool like defaultDeletionCost, potentially with a pod selector (incl field selector to grab owners like job).
I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well.
Maybe I'm getting lost in the conversation, but don't we already leverage the pod disruption cost for ordering the nodes when we are considering which nodes should be considered first for any disruption operation: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/helpers.go#L125?
I'm also skeptical that this solves the problem that was originally raised around this issue. Isn't the proper fix here to ensure that Karpenter taints nodes before it validates which pods have scheduled to it so that we can reduce our chance of race conditions around this checkpoint? Our biggest problem today is that we grab the nodes and the pods and then determine if there are any karpenter.sh/do-not-disrupt
pods on the node, but we might have missed a pod schedule during this time.
If we taint ahead of checking this, it seems like we just circumvent our leaky behavior with pods today.
@jonathan-innis I am also lost here is my original issue fixed with your ideas? I just cannot add the do not evict stuff to all jobs without introducing additional complexity like kubemod.
@runningman84 if you're asking for Karpenter to solve this so you don't need to install a second tool, that game's not worth the candle (it's a lot of effort for Kubernetes contributors that isn't justified by the benefits to end users).
It is difficult to ensure all pods have the do not evict annotation.
You can use a mutating admission webhook to achieve this; doing that is a very well known approach.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
We're seeing cases that look like the following with the Actions Runner Controller project to run GitHub Actions. It's reducing our confidence in our testing, and generally the actions are done within about 10 minutes or so. TerminationGracePeriodSeconds on the pods running the actions doesn't seem to help.
One odd thing is that we see the following pattern, which I think is intended to be consolidation:
β Events: β
β Type Reason Age From Message β
β ---- ------ ---- ---- ------- β
β Warning FailedScheduling 83s default-scheduler 0/9 nodes are available: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 8 Insufficient cpu. preemption: 0/9 nodes are available: 1 Preemption is not helpful fo β
β r scheduling, 8 No preemption victims found for incoming pod. β
β Normal Scheduled 82s default-scheduler Successfully assigned arc-runners/arc-runner-set-ckxxm-runner-hfbfc to ip-10-0-1-69.ec2.internal β
β Normal Pulling 81s kubelet Pulling image "XXXX" β
β Normal Pulled 81s kubelet Successfully pulled image "XXXX" in 226ms (226ms including waiting) β
β Normal Created 81s kubelet Created container runner β
β Normal Started 81s kubelet Started container runner β
β Normal Nominated 4s karpenter Pod should schedule on: nodeclaim/default-tvqd9, node/ip-10-0-2-189.ec2.internal β
β Normal Killing 3s kubelet Stopping container runner
Note that the pod was Scheduled
and then karpenter later came along and Nominated
the pod to a different node.
Did the suggestion of a PodDisruptionBudget work / help others? It would be nice to have a common documented FAQ topic for this, as I imagine that using Karpenter to handle bursty "job-type" workloads is probably fairly common.
/remove-lifecycle stale
@evankanderson This sounds like the intended use case for the karpenter.sh/do-not-disrupt
annotation (docs), is this occurring when job pods have this annoation? Karpenter shouldn't voluntarily disrupt a node if any bound pods have this annotation. There is a known race condition which can result in do-not-disrupt
pods scheduling to nodes after the consolidation decision was validated. A fix was merged in v0.36.1
that largely mitigated this but it can still occur occasionally, but an inflight change will block node termination on these pods which should completely resolve this issue.
@jmdeal We have been facing similar race condition problem wherein our k8s job pods gets scheduled right before the tainting of nodes happens or right after the consolidation decision is made.
An Inflight change will block this node terminations on the pods - What does that mean? How long will it take to have this fix implemented?
karpenter.sh/do-not-disrupt
to not disrupt.
Tell us about your request
Right now the consolidation replaces nodes with running jobs. This breaks a lot of deployments (doing db upgrades or other stuff). One way to handle this issue is putting do not eviction annotations on the corresponding pods. But it is difficult to ensure that if you have dozens of teams using a given cluster.
Could this be handled in a more generic way?
All these jobs look like this: Controlled By: Job/xxx-yyy-zzz-28031590
Can karpenter optionally just use the do not evict behavior for running jobs?
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
It is difficult to ensure all pods have the do not evict annotation.
Are you currently working around this issue?
Manually set do not evict to all corresponding pods.
Additional Context
No response
Attachments
No response
Community Note