Cordon node with a do-not-evict pod when ttlSecondsUntilExpired is met

cest-pas-faux commented 1 year ago

Tell us about your request

Regarding the do-not-evict annotation, it currrently prevents some nodes to be deprovisioned on our clusters.

I know, that's the goal and i'm fine with it, but would it be possible to cordon the node at least when the ttlSecondsUntilExpired is met ?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

We provision expensive nodes as a failback for some workloads with a ttl of 3 hours, and we saw some still running with a 6 hours uptime. The annotation was preventing the drain/deletion as cronjobs kept being scheduled on it, but a manual cordon fixed it within minutes.

Are you currently working around this issue?

No

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

ellistarn commented 1 year ago

This seems fair to me. @njtran @jonathan-innis , thoughts?

cest-pas-faux commented 1 year ago

Actually it could be " Always cordon the node when the ttl is met, then proceed with the usual logic (to drain or not to drain)"

ellistarn commented 1 year ago

We've debated this logic. Right now, cordon/drain happens when a kubectl delete node is triggered. This causes the node to cross a point of no return. Once that cordon happens, it will eventually be deleted. I think this behavior is probably fine for expiration.

JTCunning commented 1 year ago

Hello, after upgrading from 0.16 to 0.19, we have also experienced an unexpected change in behavior between the ttl-based deprovisioner and the do-not-evict annotation. On 0.16, the deprovisioner would issue a node deletion request, cordon the node, and only drain the node once all do-not-evict-annotated pods were gone. Now observing 0.19, nodes with do-not-evict-annotated pods running will not receive the same deletion request, and will only receive the request once all pods are removed. For better or for worse, we've relied on this previous behavior and the loss of it currently blocks us from upgrading.

What would work for our team going forward is any sort of indication that Karpenter intends to delete a node that has do-not-evict-annotated pods, which was previously just checking for a node deletion timestamp. Whether that's restoring the previous behavior, cordoning the intended node, or some sort of karpenter.sh/evict-them-yourself annotation on the node that would signify Karpenter's intentions.

Please let me know if this is too far off-topic for the original issue and I will create a new one.

ellistarn commented 1 year ago

This is tricky, since there are a number of termination cases, and the desired behavior isn't always clear. One of the ideas we've come to is thinking in terms of voluntary and involuntary disruptions.

Voluntary Disruptions:

Drift (AMI is out of date, requirements no longer match provisioner, etc)
Consolidation (emptiness, replacement, etc)

Involuntary Disruption:

Spot Interruption
Node Health
Expiration (this one is debatably voluntary)

Potential termination behaviors:

[Opportunistic] Don't cordon the node until we think we can successfully drain all of its pods
[Eventual] Cordon the node, but don't drain it until we're sure we can drain all of them
[Graceful] Cordon the node, drain it, respect PDBs/do-not-evict
[Forceful] Cordon the node, drain it, don't respect PDBs/do-not-evict

We want to "do the right thing" wherever possible, and have this be something that users simply don't think about.

Is it possible for us to pick one behavior for each type of disruption?
Do users need to be able to configure different behaviors for different types of disruption?
If configurable, should it be per provisioner?

If I were to guess at the right set of defaults, it might be something like:

Drift -> Opportunistic
Consolidation -> Opportunistic
Spot Interruption -> Forceful
Node Health -> Forceful
Expiration -> ???

JTCunning commented 1 year ago

@ellistarn: I agree on all counts, and agree the hardest with your choices on the defaults for the termination behaviors.

I'd definitely consider overriding those defaults as "Advanced Usage", especially since overriding them would make the most sense on a per-provisioner basis. That being said, I can easily admit that @planetscale's dependence on some of these behaviors as creep towards "advanced".

The behavior you've outlined would be exactly what we're personally looking for. I can also see the ability to override these behaviors as a great way out of future issues where someone suggests your default choice is the wrong choice 🙃.

ellistarn commented 1 year ago

@JTCunning, @njtran reminds me that we actually do Eventual not Graceful as described. WDYT about that as the default behavior?

njtran commented 1 year ago

Expiration is currently Opportunistic, the same as upcoming Drift and current Consolidation. This was originally Eventual by relying only on the Termination logic which doesn't drain a node if it notices that any of the pods have a do-not-evict annotation.

While expiration used to be Eventual, we made it Opportunistic since a lot of users have expressed some concern with the speed of Expiration. Users who create multiple nodes at the same time with a scale-up event also saw their nodes all terminate at the same time with Eventual deprovisioning. Opportunistic implements some rate-limiting here.

I can see Expiration making sense as Opportunistic, Eventual, or Graceful depending on how users rely on Expiration.

Opportunistic: using Expiration for best-effort periodic syncs/recycling with requirements
Eventual: using Expiration for soft requirements on instance up-time
Graceful: using Expiration for hard requirements on instance up-time
Forceful: not viable in my opinion using Expiration for hard requirements - tolerating disrupting workloads

sftim commented 1 year ago

Related to this, I think I'd like to see a counter metric for amount of time that a node has exceeded it's TTL. Whilst that's climbing, there's a problem. If it stays climbing, there's a bigger problem.

I'm not sure if I'd want a series per provisioner or per (provisioner,node name) tuple.

Overall, that lets me as an operator manually intervene to implement Forceful, so I don't end up with a very stale node by accident.

njtran commented 1 year ago

Added a new issue @sftim for your use-case https://github.com/aws/karpenter-core/issues/724

jukie commented 10 months ago

Has there been any more discussions on this? Would PRs adding this behavior be accepted?

jonathan-innis commented 10 months ago

Has there been any more discussions on this

I think we should consider discussing this in the next working group. I'd like to see a design on this one considering the impact that this could have to the cluster, given there could be cases if we're not careful where all nodes on the cluster get tainted all at once.

As an example, if we have a set of nodes that are expired all at once and we cordon them all at once, is that a reasonable stance to take, knowing that those nodes will get eventually disrupted? Probably, assuming that our disruption logic can handle the disruption quick enough that have underutilization on that node isn't a problem, but what if that node has a fully-blocking PDB? What if that node has a do-not-evict pod and that pod never goes away? Do we need some forceful termination that happens after a node has been cordoned with a blocking pod for some amount of time? There might be some interactions here that we should consider with #743

jukie commented 10 months ago

Agreed there are some risks and also some relation to other proposed features under https://github.com/aws/karpenter/issues/1738 but I don't think that should limit from implementing this through a feature gate or by disabling as default. I think #752 could be handled in parallel if it's determined the scaling risks should be accounted for.

How do I see the schedule for bi-weekly's? Looks like the notes haven't been updated in awhile so it's unclear whether it's this coming week or next.

njtran commented 10 months ago

@jukie sorry for the unfortunate timing, as I just came off vacation. Just noticed you asked here, but we just had a WG an hour ago. You can check when and where these meetings are happening here: https://github.com/kubernetes-sigs/karpenter#community-discussion-contribution-and-support. You can message me or @jonathan-innis in the kubernetes slack as well if you want to discuss more.

jukie commented 10 months ago

Thanks, I'll reach out and can discuss further there!

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

cest-pas-faux commented 7 months ago

/remove-lifecycle stale

jmdeal commented 6 months ago

/assign

Bryce-Soghigian commented 6 months ago

/lifecycle frozen

mseiwald commented 3 months ago

Is there any update on this? We are interested in this feature specifically for K8s upgrade situations where we do not want new pods to be scheduled to outdated nodes even if the node contains pods with the do-not-disrupt annotation.

sftim commented 3 months ago

Is there any update on this?

So far as I know, the issue history tracks progress here and isn't missing any important details.

ugurgural commented 1 week ago

Hi, any updates on here? Last time I heard on this from @jonathan-innis that it is still in debate.

kubernetes-sigs / karpenter