Do not evict running jobs

runningman84 commented 1 year ago

Tell us about your request

Right now the consolidation replaces nodes with running jobs. This breaks a lot of deployments (doing db upgrades or other stuff). One way to handle this issue is putting do not eviction annotations on the corresponding pods. But it is difficult to ensure that if you have dozens of teams using a given cluster.

Could this be handled in a more generic way?

All these jobs look like this: Controlled By: Job/xxx-yyy-zzz-28031590

Can karpenter optionally just use the do not evict behavior for running jobs?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

It is difficult to ensure all pods have the do not evict annotation.

Are you currently working around this issue?

Manually set do not evict to all corresponding pods.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

engedaam commented 1 year ago

Can you describe your use-case more? Would it not be possible to pass the annotation using the spec.template in the deployment?

tzneal commented 1 year ago

If you use Kyverno, there's an existing Kyverno policy to add the do-not-evict annotation to Job/CronJob pods automatically. https://kyverno.io/policies/karpenter/add-karpenter-donot-evict/add-karpenter-donot-evict/

ellistarn commented 1 year ago

Were considering this request as part of a batch of eviction based features. @njtran to follow up.

Almenon commented 1 year ago

Just ran into a similar issue, but with self-hosted CircleCI pods. They basically act the same as jobs (should only be run once). When Karpenter kills the node the pods are killed and don't come back. This issue was took a long time to figure out because I saw exit code 137 events from containerd, which normally indicates OOM, but in this case the culprit was actually Karpenter.

This is worth a warning in the docs, IMO

ellistarn commented 1 year ago

johngmyers commented 1 year ago

We create a PodDisruptionBudget which allows for zero disruptions and has all such jobs in its selector.

ellistarn commented 1 year ago

@johngmyers , can you post an example of the PDB you're using? I'm a big fan of driving eviction behavior through the PDB API, if at all possible.

johngmyers commented 1 year ago

I don't have access to it at the moment. It has a large spec.minAvailable and has a selector for a label that we put on all such jobs. It basically makes the label equivalent to the do-not-evict annotation, except being recognized by the voluntary eviction API.

HansBraun commented 1 year ago

We have the same problem with tekton pipline taskrun pods. They get evicted after karpenter has cordoned a node. The PipeLine has the annotation "karpenter.sh/do-not-evict: true" and passes it to the taskruns, but some taskrun pods got still evicted, after karapenter consolidates a node.

lzyli commented 1 year ago

We have the same problem with cronjobs. We're running jobs on some dedicated nodes, if we use "karpenter.sh/do-not-evict: true" or PDB, Karpenter just skips nodes and does nothing. Nodes are never deprovisioned. The deprovisioning flow for cronjobs should be something like this: cordon nodes -> waiting for cronjobs finish -> drain nodes.

yangwwei commented 1 year ago

We have the same problem with cronjobs. We're running jobs on some dedicated nodes, if we use "karpenter.sh/do-not-evict: true" or PDB, Karpenter just skips nodes and does nothing. Nodes are never deprovisioned. The deprovisioning flow for cronjobs should be something like this: cordon nodes -> waiting for cronjobs finish -> drain nodes.

100% agree with this

johngmyers commented 1 year ago

My understanding is that a voluntary eviction will succeed on a pod inSucceeded or Failed state. So a PDB would work as desired on pods from Jobs.

Perhaps karpenter.sh/do-not-evict: true is incorrectly failing to evict pods in those states?

njtran commented 11 months ago

For comments in this issue regarding pods with do-not-evict being ignored, it's possible you were running into an issue fixed in v0.31.1 (PR: https://github.com/aws/karpenter-core/pull/583).

For others who are experiencing continual behavior of job pods terminating and rescheduling with do-not-evict blocking node disruption, we're tracking this effort through the first two issues in the mega issue here: https://github.com/aws/karpenter-core/issues/624

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sftim commented 8 months ago

We could try to set a label on Jobs (and their Pods) to declare whether or not the Pods are trivially interruptible. That allows people to opt in to the behavior where Job-owned Pods don't block a node replacement.

Eventually, supplant this with a MutatingAdmissionPolicy (these are very new) or other mechanism, so that you get per-namespace control to select which Jobs do or don't get marked as interruptible.

That feels like it'd fit Karpenter / node autoscaling well. With that architecture, Karpenter doesn't have to know about the Job API; it just has to look at the Pods bound to a node and check their labels.

karpenter.sh/do-not-evict is not the same as example.kubernetes.io/interruptible, and it's not even the exact opposite in meaning. It would be weird to set both of them on a Pod, though.

sftim commented 8 months ago

If we want that label key defined, Kubernetes can help.

ellistarn commented 8 months ago

Curious if a "disruption cost" annotation on the pod could help make progress towards this class of problem. I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well.

Bryce-Soghigian commented 8 months ago

/remove-lifecycle stale

sftim commented 8 months ago

Curious if a "disruption cost" annotation on the pod could help make progress towards this class of problem. I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well.

We have controller.kubernetes.io/pod-deletion-cost supported in other parts of the project.

sftim commented 8 months ago

One way Karpenter could support that: have a NodePool-level setting that maps Pod deletion cost values to actual money.

Either a scale factor (complex to teach), or a CEL expression (really complex to teach, if I'm honest). It needs to be NodePool level in case you have on-prem and cloud NodePools in the same cluster.

Related to this, maybe we'd like to add a node deletion cost annotation (which, like Pod deletion cost, would be an abstract integer value).

ellistarn commented 8 months ago

What is cost in units of for the current use? Isn't the node deletion cost the sum of the pods? I could potentially see a config value in NodePool like defaultDeletionCost, potentially with a pod selector (incl field selector to grab owners like job).

jonathan-innis commented 8 months ago

I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well.

Maybe I'm getting lost in the conversation, but don't we already leverage the pod disruption cost for ordering the nodes when we are considering which nodes should be considered first for any disruption operation: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/helpers.go#L125?

I'm also skeptical that this solves the problem that was originally raised around this issue. Isn't the proper fix here to ensure that Karpenter taints nodes before it validates which pods have scheduled to it so that we can reduce our chance of race conditions around this checkpoint? Our biggest problem today is that we grab the nodes and the pods and then determine if there are any karpenter.sh/do-not-disrupt pods on the node, but we might have missed a pod schedule during this time.

If we taint ahead of checking this, it seems like we just circumvent our leaky behavior with pods today.

runningman84 commented 8 months ago

@jonathan-innis I am also lost here is my original issue fixed with your ideas? I just cannot add the do not evict stuff to all jobs without introducing additional complexity like kubemod.

sftim commented 8 months ago

@runningman84 if you're asking for Karpenter to solve this so you don't need to install a second tool, that game's not worth the candle (it's a lot of effort for Kubernetes contributors that isn't justified by the benefits to end users).

It is difficult to ensure all pods have the do not evict annotation.

You can use a mutating admission webhook to achieve this; doing that is a very well known approach.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

evankanderson commented 4 months ago

We're seeing cases that look like the following with the Actions Runner Controller project to run GitHub Actions. It's reducing our confidence in our testing, and generally the actions are done within about 10 minutes or so. TerminationGracePeriodSeconds on the pods running the actions doesn't seem to help.

One odd thing is that we see the following pattern, which I think is intended to be consolidation:

│ Events:                                                                                                                                                                                                                            │
│   Type     Reason            Age   From               Message                                                                                                                                                                      │
│   ----     ------            ----  ----               -------                                                                                                                                                                      │
│   Warning  FailedScheduling  83s   default-scheduler  0/9 nodes are available: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 8 Insufficient cpu. preemption: 0/9 nodes are available: 1 Preemption is not helpful fo │
│ r scheduling, 8 No preemption victims found for incoming pod.                                                                                                                                                                      │
│   Normal   Scheduled         82s   default-scheduler  Successfully assigned arc-runners/arc-runner-set-ckxxm-runner-hfbfc to ip-10-0-1-69.ec2.internal                                                                             │
│   Normal   Pulling           81s   kubelet            Pulling image "XXXX"                                                                                                                            │
│   Normal   Pulled            81s   kubelet            Successfully pulled image "XXXX" in 226ms (226ms including waiting)                                                                             │
│   Normal   Created           81s   kubelet            Created container runner                                                                                                                                                     │
│   Normal   Started           81s   kubelet            Started container runner                                                                                                                                                     │
│   Normal   Nominated         4s    karpenter          Pod should schedule on: nodeclaim/default-tvqd9, node/ip-10-0-2-189.ec2.internal                                                                                             │
│   Normal   Killing           3s    kubelet            Stopping container runner

Note that the pod was Scheduled and then karpenter later came along and Nominated the pod to a different node.

Did the suggestion of a PodDisruptionBudget work / help others? It would be nice to have a common documented FAQ topic for this, as I imagine that using Karpenter to handle bursty "job-type" workloads is probably fairly common.

evankanderson commented 4 months ago

/remove-lifecycle stale

jmdeal commented 4 months ago

@evankanderson This sounds like the intended use case for the karpenter.sh/do-not-disrupt annotation (docs), is this occurring when job pods have this annoation? Karpenter shouldn't voluntarily disrupt a node if any bound pods have this annotation. There is a known race condition which can result in do-not-disrupt pods scheduling to nodes after the consolidation decision was validated. A fix was merged in v0.36.1 that largely mitigated this but it can still occur occasionally, but an inflight change will block node termination on these pods which should completely resolve this issue.

Hafeez-Aqfer commented 1 month ago

@jmdeal We have been facing similar race condition problem wherein our k8s job pods gets scheduled right before the tainting of nodes happens or right after the consolidation decision is made.

An Inflight change will block this node terminations on the pods - What does that mean? How long will it take to have this fix implemented?

Does it mean making the consolidation on the go once the decision is made, add a taint on to it.
Re-validate the node while the termination happens for pods which are running and with annotation karpenter.sh/do-not-disrupt to not disrupt.

kubernetes-sigs / karpenter