Taint nodes with a NoSchedule for Consolidation before Validation begins

njtran commented 1 year ago

Description

What problem are you trying to solve? Karpenter adds a karpenter.sh/disruption:NoSchedule=disrupting taint for Consolidation actions after a 15s validation period. There is a narrow interval of time where pods blocking eviction can schedule to nodes when Karpenter taints the nodes, resulting in these pods getting eventually evicted during termination. This was validated to be a race of < 300ms here.

While https://github.com/aws/karpenter-core/issues/624 includes PreferNoSchedule as an option here, it may make more sense to only use NoSchedule for Consolidation here to completely remove this race condition.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

billrayburn commented 10 months ago

/assign @njtran

ljosyula commented 10 months ago

/assign @ljosyula

jmdeal commented 9 months ago

/assign jmdeal

wmgroot commented 9 months ago

I'm working on implementing a patch to deal with this issue at my company. We're running into this constantly when using Karpenter for a cluster with almost nothing but ephemeral CI jobs, which seems like a recipe for constant disruption evaluation because the pod profile is highly volatile.

Can I ask that we prioritize an air-tight fix for the issue instead of another "reduce the chance of it occurring" option that is proposed with the PreferNoSchedule implementation? Right now the do-not-disrupt annotation does not do what it says on tin.

njtran commented 9 months ago

@wmgroot Can you share how often you're hitting this? From my understanding the race is super small window, so I'm curious exactly how often this is hitting. Are you seeing a do-not-disrupt pod added/scheduled to a node, only for that node to be soon consolidated after?

wmgroot commented 9 months ago

We've seen almost 2300 occurrences in the last week with Gitlab CI job pods. Every pod has the do-not-disrupt annotation. This cluster has about 25 NodeClaims which are primarily used for extremely volatile CI jobs. Screenshot from 2024-02-14 11-45-20

Here is the patch I've applied to our clusters to address the issue, which verifies there are no do-not-disrupt pods after a simple 30s sleep. We've seen 0 occurrences of the issue since I applied the patch. https://github.com/wmgroot/karpenter/pull/1/files

wmgroot commented 9 months ago

Here are some notes I've taken as I've explored this particular problem space. Hopefully this is helpful as background to anyone else trying to understand the problem.

I'd ultimately prefer we didn't use this annotation at all because it has node-level impact (a single pod can keep a node of arbitrary size around indefinitely). The limitations of Gitlab CI are currently making it necessary to use the annotation to complete a transition from cluster-autoscaler to Karpenter, which is still worth the tradeoff.

Job-type workloads want a way to delay or avoid unnecessary disruption, since this results in lost work that must be performed again.
There are multiple cases where pod shutdown is desired, but each situation might have a different preferred outcome. For example: a. Initial shutdown of a node due to efficiency consolidation or a regular maintenance update. b. Node lifetime timeout due to compliance requirements. c. User-initiated cancellation of a CI pipeline.
The do-not-disrupt annotation is Karpenter-specific, making it not portable between different cluster architectures (which may or may not include cluster node autoscaling.
PDBs are architecture agnostic, but are not a great fit for job-type workloads, they're designed for redundant pod patterns. In our clusters we have several OPA policies that forbid the creation of PDBs that do not tolerate voluntary disruption, blocking their use for single pod cases.
PreStop hooks are architecture agnostic, but make it difficult to support the desired outcomes for the cases described in point 2. They are not aware of "why" the pod is being requested to terminate.

Point 5 is the current reason why we can't just use a preStop hook to address the disruption of CI jobs. The hook doesn't know the difference between karpenter terminating the pod for node consolidation (we want to delay disruption) and a user in the Gitlab UI terminating the pod to cancel a CI pipeline (we want to terminate the pod immediately, but still run any post-job cleanup, which is skipped via forced pod termination). Adding the preStop hook as Gitlab CI is currently implemented would block termination of the job pod if a user attempted to cancel the job through the UI.

In summary I believe this issue is bigger than both Karpenter and Gitlab CI. The tooling we have in K8s is either not flexible enough to support the high disruption patterns we'd like to use in more efficient clusters, or it's not clear enough how to use the existing tooling.

mikkeloscar commented 9 months ago

We are also still seeing this issue from time to time, though less often than previous to the mitigation in https://github.com/kubernetes-sigs/karpenter/pull/583

I observed the following events:

{"eventNamespace":"default","involvedObject":{"kind":"Node","name":"ip-172-31-6-136.eu-central-1.compute.internal","uid":"473d160c-783a-427e-a75f-5c57c57e29df","apiVersion":"v1"},"reason":"DisruptionBlocked","eventMessage":"Cannot disrupt Node: Pod \"default/cdp-default-2ekrbha7wakskum3uuw492ts4y\" has \"karpenter.sh/do-not-disrupt\" annotation","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"NodeClaim","name":"default-karpenter-fh9zb","uid":"42c74ed6-77b9-41e9-ae6f-b27b4e29feab","apiVersion":"karpenter.sh/v1beta1"},"reason":"DisruptionBlocked","eventMessage":"Cannot disrupt NodeClaim: Pod \"default/cdp-default-2ekrbha7wakskum3uuw492ts4y\" has \"karpenter.sh/do-not-disrupt\" annotation","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"Node","name":"ip-172-31-6-136.eu-central-1.compute.internal","uid":"473d160c-783a-427e-a75f-5c57c57e29df","apiVersion":"v1"},"reason":"DisruptionTerminating","eventMessage":"Disrupting Node: Consolidation/Delete","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"NodeClaim","name":"default-karpenter-fh9zb","uid":"42c74ed6-77b9-41e9-ae6f-b27b4e29feab","apiVersion":"karpenter.sh/v1beta1"},"reason":"DisruptionTerminating","eventMessage":"Disrupting NodeClaim: Consolidation/Delete","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"Node","name":"ip-172-31-6-136.eu-central-1.compute.internal","uid":"473d160c-783a-427e-a75f-5c57c57e29df","apiVersion":"v1"},"reason":"FailedDraining","eventMessage":"Failed to drain node, 10 pods are waiting to be evicted","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Warning","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"Pod","namespace":"default","name":"cdp-default-2ekrbha7wakskum3uuw492ts4y","apiVersion":"v1"},"reason":"Evicted","eventMessage":"Evicted pod","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:55Z","lastTimestamp":"2024-02-26T16:41:55Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

The first two events show that the pod cdp-default-2ekrbha7wakskum3uuw492ts4y is blocking the disruption, but the very next events 1 second later is that the node is being disrupted.

This pod is used to run e2e tests and evicting this pod basically breaks our e2e run, so it's similar disruption as mentioned in the example from @wmgroot

jonathan-innis commented 8 months ago

@mikkeloscar Thats interesting. I wouldn't expect to see the behavior that you're describing. I could definitely see this race occurring when there is a pod that is just coming up and we don't see the binding to the new node yet, but it's odd that we would have a node that is blocked by a pod and then one second later we see that that same pod is no longer blocking (but no bindings have changed).

Did the state of that pod change in any way? Did it shift to a new node or did it change its phase or status conditions to indicate that it finished?

mikkeloscar commented 8 months ago

Did the state of that pod change in any way? Did it shift to a new node or did it change its phase or status conditions to indicate that it finished?

I'm not 100% sure as we rely on events that we ship to our long term storage, but in all the events about the pod I can only find the event about karpenter evicting the pod and before it was in a running state.

I looked at earlier events and the pod was just started 6s before the first attempt to disrupt, so that is maybe also a relevant thing to consider:

{"eventNamespace":"default","involvedObject":{"kind":"Pod","namespace":"default","name":"cdp-default-2ekrbha7wakskum3uuw492ts4y","uid":"1e9e7393-68db-466b-9126-956d9c9830f7","apiVersion":"v1","fieldPath":"spec.containers{main}"},"reason":"Started","eventMessage":"Started container main","source":{"component":"kubelet","host":"ip-172-31-6-136.eu-central-1.compute.internal"},"firstTimestamp":"2024-02-26T16:41:48Z","lastTimestamp":"2024-02-26T16:41:48Z","count":1,"type":"Normal","reportingComponent":"kubelet","reportingInstance":"ip-172-31-6-136.eu-central-1.compute.internal"}

DanielQ-CV commented 8 months ago

i have the same problem with CI Job/Pods. The Pod has do-not-disrupt & do-not-evict annotations. The Pods is started some seconds before Karpenter try to consolidate the node ""message":"disrupting via consolidation delete, terminating 1 candidates " The Pod will Evicted

JacobHenner commented 8 months ago

I wouldn't expect to see the behavior that you're describing. I could definitely see this race occurring when there is a pod that is just coming up and we don't see the binding to the new node yet, but it's odd that we would have a node that is blocked by a pod and then one second later we see that that same pod is no longer blocking (but no bindings have changed).

We're also seeing frequent issues under almost the same circumstances as https://github.com/kubernetes-sigs/karpenter/issues/651#issuecomment-1944397047 - GitLab CI executor pods being evicted frequently even though they have the do-not-disrupt annotation specified at pod creation time. Anecdotally (I don't have before/after metrics), the frequency of this issue increased after upgrading from v0.33.x to 0.35.0.

alex-berger commented 8 months ago

Same here, also with GitLab CI Job execution pods. Just chiming in to underline that this is a serious bug seriously affecting (karpenter) users in real life :-).

ospiegel91 commented 7 months ago

My team (big global Company) is affected by this while using argo workflows (job templates behind the scenes). Pods gets eviction notice before even entering init stage. do-not-disrupt annotation doesnt help

jmdeal commented 7 months ago

I'm going to wager we're seeing this occur more frequently as users migrate to versions of Karpenter with parallel disruption. More rapid, parallel consolidation == more occurrences of this race condition. We've been punting solving this issue for a full, airtight taint strategy redesign but given the increased prevalence we should take a temporary measure that may or not be amended when we do do the taint redesign addressing #624.

Realistically, I think @wmgroot's solution is a good route to go within our existing tainting behavior. I am curious if we really need a full 30 second waiting period, given that the pod list is eventually consistent we will need some sufficiently large waiting period. CAS implements a similar wait period that is defaulted to 5 seconds but is configurable via the node-delete-delay-after-taint flag, though I haven't seen any guidance on when a user might need to tune this.

wmgroot commented 7 months ago

Making it configurable just covers the case where someone might need to increase it in the future. I agree that 30s is likely much longer than it needs to be.

gnuletik commented 7 months ago

@wmgroot I tested your branch and it seems to have greatly improved the situation (I just made small changes: rebase from the latest published tag, restore the pdb exclusion and exclude failed / succeeded pods when comparing annotations).

However, I'm still seeing a few nodes being evicted when Karpenter is starting, here are the logs:

Starting workers
tainted node
deleted node
deleted nodeclaim

During the startup, I don't see the following message that was introduced in the PR:

waiting for 30 seconds to check for do-not-disrupt or do-not-consolidate pods that may have scheduled...

I can see it when the disruption controller is attempting to disrupt a node though.

Is there another place where nodes are being disrupted and we need to call the ValidateNoScheduleTaint() func?

Thanks!

jmdeal commented 7 months ago

These log lines should be coming from the termination controller which handles the eventual deletion of nodes instead of the disruption controller. If the disruption controller was responsible for deleting the nodes (rather than a manual user action) they should have already been tainted and validated by the disruption controller. I imagine what happened is when Karpenter was restarted a disruption decision had been made, candidates were tainted and validated, and then nodes had their deletion timestamps set. Karpenter restarted before the termination controller processed all of the nodes and it continued after it came back up. Does this line up with what you saw @gnuletik? Or were there instances of nodes with do-not-disrupt pods being disrupted?

gnuletik commented 7 months ago

Thanks for the feedback @jmdeal. Thanks also for the explanation of the termination controller.

There was no manual deletion of the nodes. There was instances of nodes with do-not-disrupt pods that was disrupted. I checked the logs before Karpenter restarted and I didn't see logs coming from the disruption controller that deleted the node.

However, before Karpenter was restarted, there was multiple occurrences of:

created nodeclaim
launched nodeclaim
registered nodeclaim

Is that possible that if Karpenter is restarted during the process of creating a new node, this lead to a pod eviction?

Also, Karpenter was restarted because there was a panic while trying to renew the lock (the apiserver was slow to answer because of heavy usage during this period):

panic: leader election lost

JacobHenner commented 7 months ago

As a temporary workaround I've implemented a validating webhook that rejects eviction requests from Karpenter if the pod is appropriately annotated. This seems to work, but is detrimental because (as I understand it) eviction will now be blocked for pods that are interrupted (e.g. spot reclaim), which is undesirable (but acceptable short-term in one particular case I'm dealing with).

@jmdeal do you have an estimate for when a short-term fix and a long-term fix will be available? We're looking to rapidly expand our use of Karpenter within our infrastructure and this issue is making it difficult to do so. Having an idea of when a short-term/long-term fix will be available will help us determine what our next steps should be.

DanielQ-CV commented 6 months ago

Hello thank you for providing the fix. We have installed the release 0.36.1 on Friday and the whole weekend there were no crashes in our CI/CD agents. The last weekend we had an interruption almost every hour. The fix seems to be working well. Thank you very much !

armondressler commented 6 months ago

We've upgraded to 0.36.1 around 6 days ago, the issue hasn't reoccurred as of yet. Appreciate your work, thanks guys.

JacobHenner commented 6 months ago

I've also not received any reports of unexpected consolidation disruptions since upgrading to 0.36.1. However, I've seen drift disruptions on nodes with the do-not-disrupt annotation on one or more assigned pods, which I believe is unexpected. I've worked around this for now by disabling drift disruption.

jmdeal commented 6 months ago

Definitely not expected, how frequently are you seeing this occur? Do you have logs you can share? Maybe worth opening a separate issue?

armondressler commented 6 months ago

We've upgraded to 0.36.1 around 6 days ago, the issue hasn't reoccurred as of yet. Appreciate your work, thanks guys.

Looks like we're still seeing pods getting evicted (despite do-not-disrupt annotation) immediately after scheduling due to node consolidation, even when running v0.36.1.

Karpenter log entries for the node as follows:

{"level":"INFO","time":"2024-05-24T11:26:25.504Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"application-amd64-bottlerocket-on-demand-v1-bzkbj","provider-id":"aws:///eu-central-1c/i-0c349508d7a941d34","node":"ip-10-48-37-168.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-24T11:26:55.171Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"application-amd64-bottlerocket-on-demand-v1-bzkbj","provider-id":"aws:///eu-central-1c/i-0c349508d7a941d34","node":"ip-10-48-37-168.eu-central-1.compute.internal","allocatable":{"cpu":"15890m","ephemeral-storage":"60711501722","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"28604942517","pods":"234"}}
{"level":"INFO","time":"2024-05-24T11:30:59.945Z","logger":"controller.disruption","message":"disrupting via consolidation delete, terminating 1 nodes (0 pods) ip-10-48-37-168.eu-central-1.compute.internal/c5a.4xlarge/on-demand","commit":"fb4d75f","command-id":"e2b9463c-d676-4c3e-9df8-1c00b5101946"}
{"level":"INFO","time":"2024-05-24T11:31:00.412Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-10-48-37-168.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-24T11:31:06.694Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-10-48-37-168.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-24T11:31:07.213Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"application-amd64-bottlerocket-on-demand-v1-bzkbj","node":"ip-10-48-37-168.eu-central-1.compute.internal","provider-id":"aws:///eu-central-1c/i-0c349508d7a941d34"}

The k8s event log regarding the pod in question is rather straightforward:

10m         Normal    Scheduled                    pod/runner-svhlvxier-project-30216564-concurrent-0-r6wflsak                           Successfully assigned gitlab-runner/runner-svhlvxier-project-30216564-concurrent-0-r6wflsak to ip-10-48-37-168.eu-central-1.compute.internal
10m         Normal    Evicted                      pod/runner-svhlvxier-project-30216564-concurrent-0-r6wflsak

I'd estimate this occurs once every 50 CI jobs. Do you require any other logs or loglevels to better analyze the issue?

jmdeal commented 5 months ago

Hmm, that seems pretty clear cut. There are some inflight changes that will mitigate the impact of these slipping through, #916 includes a change that will prevent pods with the do-not-disrupt annotation from being evicted during our drain process (with TerminationGracePeriod providing a mechanism for cluster operator to set an upperbound on node lifetime). Would this change be sufficient for your use case? The other option left for us is to have the overlapping taint / validation period, which will come at the cost of increased disruption latency since we would need to wait a few seconds to ensure the kube-scheduler has seen the taint.

armondressler commented 5 months ago

916 includes a change that will prevent pods with the do-not-disrupt annotation from being evicted during our drain process

The do-not-disrupt annotation being respected during drain should fix our issue. Let's see how it behaves when https://github.com/kubernetes-sigs/karpenter/pull/916 gets merged. Thanks for your hard work :)

Nuru commented 4 months ago

@jmdeal I'm wondering if we might be subject to this bug, or a similar bug, or if Karpenter properly handles this use case. It is difficult to catch happening, and I have not seen it yet (we only just started using this mechanism), but I think it is likely that Karpenter does not handle this properly, and I want to get it addressed. I don't see an easy way to monitor for it, and I do not want to put the effort into catching the bug if you can confirm from reviewing the code that the bug exists. Please let me know your thoughts.

Update: this is happening

I have now seen this series of events, with short enough intervals that I would call this a race condition:

Pod deployed to Node X without annotation
Pod adds annotation "karpenter.sh/do-not-disrupt": "true" to itself
Karpenter (v0.37.0) decides Node X is underutilized, and logs "disrupting via consolidation delete" despite the annotated Pod
Karpenter taints the Node with NoSchedule
Something sends SIGTERM to the Pod while it is still running

This happens often enough that we cannot use WhenUnderutilized for NodePools running these Pods.

End of Update

We are running ephemeral GitHub Action Runners. These Pods take jobs off a queue, run the job, then exit. They take a while to start up (anywhere from 10 seconds to 3 minutes depending on what else is going on with the Node), so we keep a warm pool of idle runners ready to accept jobs, and then launch new Pods when those idle Pods start doing real work.

Idle Pods can be disrupted, so we schedule and deploy Pods with no annotations. We want the idle Pods to be disrupted if it will allow consolidation, because we can scale from a few to a few hundred and back, and do not want idle Pods scattered across big instances to prevent them from being consolidated to a single small instance.

The catch is, once a Pod takes a job, we want the job to run to completion, and do not want it disrupted. GitHub Action Runners to not have an automatic retry mechanism (in part, because jobs are long and complex and not likely idempotent), so if they are killed, it requires manual intervention to decide what to do about the incomplete job. So to avoid having them interrupted, the first thing the Pod does after taking a job is add the "do-not-disrupt" annotation to itself.

Now in this case, the Pod is already on the Node, unaffected by any NoSchedule taints, and the Node looks like a consolidation candidate until the Pod adds the annotation. So my question is, in this case, will adding the annotation stop a consolidation in progress from going forward? If not, is that a bug?

If this use case is not already properly handled, how can I help get it handled (without writing code)?

javidaslan commented 2 months ago

Confirming @Nuru 's comment, we are seeing the same behaviour with our airflow jobs on v0.37.0.

kubernetes-sigs / karpenter