kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
426 stars 145 forks source link

Custom Deprovisioning Trigger Mechanism #688

Open sidewinder12s opened 11 months ago

sidewinder12s commented 11 months ago

Description

Allow users to trigger node Drift

What problem are you trying to solve?

In Karpenter pre-v0.28, I had started using the karpenter drift annotation karpenter.sh/voluntary-disruption=drifted as a way to force nodes to get replaced in an orderly fashion when I changed configuration that was not supported by Karpenters Drift detection.

In v0.28 this was removed and now the annotation is simply removed by Karpenter.

I found the ability to trigger drift useful in testing and in filling in the gaps in Drift support. I'd also assume long term, there may be corner cases users would want to trigger replacement on that drift cannot detect or detect easily.

Perhaps just another annotation indicating user requested drift so that Karpenter can replace nodes in an orderly manner and while respecting deprovisioning controls.

How important is this feature to you?

njtran commented 11 months ago

While you could just do kubectl delete node, it sounds like you want your own custom way to mark nodes for Karpenter to eventually deprovision them. We discussed doing this when migrating all deprovisioning into the deprovisioning controller as it is now, but we haven't seen the signal to implement something like this yet.

In the near future, we'll release some additional work needed for the full machine migration, which moves these disruption decisions to the Machine Status conditions here: https://github.com/aws/karpenter-core/pull/319.

In all, I think it's reasonable to allow a custom controller to annotate nodes as needing deprovisioning, rather than letting users mess with the Expiration/Drift annotations.

May I ask what kind of custom signals you use to choose nodes as being deprovisioned? Is there a gap in the deprovisioning logic that we need to implement, or is it simply just a matter of filling in the gaps where Drift isn't implemented yet?

If you're also willing to design and implement this (both should be relatively simple), I'm happy to walk through this with you and get you ramped up on it.

sidewinder12s commented 11 months ago

I mean, right now there are a ton of gaps. It was quite useful that I could just annotate all at once and Karpenter would do a serial replacement to minimize disruption in the cluster. It generally hasn't been clear to me how disruptive it'd be to just delete a pile of nodes wholesale.

Long term, I was sorta expecting there are always going to be gaps (is karpenter going to be able to drift on userdata changes for example?).

My current migration is dockerd -> containerd which is not drifted.

I don't think I have slack probably this quarter to get into this.

njtran commented 11 months ago

Long term, I was sorta expecting there are always going to be gaps (is karpenter going to be able to drift on userdata changes for example?).

Yeah we should be able to drift for all provisioning configurations in the future. This work is in-flight. Documentation here, and first PR of a couple here

njtran commented 10 months ago

I updated the title to be more general based off our discussion.

This could look like users adding an annotation/label/taint for Karpenter to discover and then implement it as another deprovisioner in the Deprovisioning loop.

sidewinder12s commented 10 months ago

One workflow that has come up that might be applicable for this is how to eventually force all pods that are using karpenter.sh/do-not-evict: "true" to get consolidated/drifted in a timely manner.

We'd found that the nodes end up getting left around since Karpenter is also not cordoning the nodes that these pods are on to allow them to eventually drain out of the cluster.

Another thought though would be to modify how this annotation is treated if maintenance window type support was added to Karpenter (ref: aws/karpenter-core#753). I think at least operationally allowing a maintenance window to force node replacements may be desirable by a lot of organizations.

dblackdblack commented 9 months ago

https://github.com/aws/karpenter-core/issues/688

It generally hasn't been clear to me how disruptive it'd be to just delete a pile of nodes wholesale.

Exactly this. If this is actually a safe operation, I'd be happy to do that, but it just isn't obvious right now whether kubectl delete no is "safe" when I have a routine need to replace all nodes. As an example, there was just a 12h network problem in us-west-2 and I want to safely replace all the nodes in the cluster just to be safe.

If kubectl delete no is safe, could you update the docs? If it isn't safe, please also update the docs πŸ˜„

This is essentially the equivalent of an ASG's instance refresh feature that's notably missing since I switched to karpenter.

yuvalavidor commented 9 months ago

One workflow that has come up that might be applicable for this is how to eventually force all pods that are using karpenter.sh/do-not-evict: "true" to get consolidated/drifted in a timely manner.

I think that the obvious use case here is a cluster upgrade where you want to upgrade your nodes. kubectl delete node is a way of doing that, but i would like to see a way that spins up a new node with the same size before the node is deleted to reduce the time for pods to be scheduled, and in a statefulset its even more time consuming

and also, im actually asking, what would be the best way of doing so in that situation right now (0.30.0)

sidewinder12s commented 9 months ago

One workflow that has come up that might be applicable for this is how to eventually force all pods that are using karpenter.sh/do-not-evict: "true" to get consolidated/drifted in a timely manner.

I think that the obvious use case here is a cluster upgrade where you want to upgrade your nodes. kubectl delete node is a way of doing that, but i would like to see a way that spins up a new node with the same size before the node is deleted to reduce the time for pods to be scheduled, and in a statefulset its even more time consuming

and also, im actually asking, what would be the best way of doing so in that situation right now (0.30.0)

I haven't really found any better way to deal with it than draining or deleting the node. I actually am not sure if with Karpenter now on delete would launch a new node before draining it.

garvinp-stripe commented 8 months ago

Wanted to add more to this. Currently we have an internal system that marks node for killing for reasons that K8s has no idea about for SOX and/or compliances or CVE related reasons. We are building a controller to read these and annotated our nodes for killing. We realize having 2 systems that kill nodes means it is difficult if not impossible to understand termination budget of the fleet so if we can get Karpenter to reap instead then we can maintain a realistic termination budget.

njtran commented 8 months ago

@garvinp-stripe or @sidewinder12s, let me know if you want to help implement this. I can help guide through this, and it should be fairly straightforward.

tasdikrahman commented 8 months ago

/assign

njtran commented 8 months ago

Posting thoughts here for you on how to approach this:

Design

Implementation Currently, we iterate through the disruption, one method at a time.

There's a Method interface() that consists of

jeesmon commented 7 months ago

@tasdikrahman Any update on this? We are looking for a similar feature to bulk recycle nodes on-demand. Thanks!

garvinp-stripe commented 7 months ago

Sorry @njtran I was out of town for a bit. Let me check with my team to see if I can spend some time around this

tasdikrahman commented 7 months ago

Hey folks sorry for the delay, I am starting work on this, this week. Will keep you posted here.

jonathan-innis commented 7 months ago

@tasdikrahman Feel free to reach out if you need any assistance here. I know that @njtran is on PTO for a bit so he may be slower to respond.

garvinp-stripe commented 2 days ago

@njtran With forceful expiration coming back. Do you think we can simply mark nodes as "expired" for this scenario?