kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
643 stars 207 forks source link

feat(disruption): add node notready controller #1755

Closed mariuskimmina closed 2 weeks ago

mariuskimmina commented 1 month ago

Fixes #1659

Description We would like karpenter to be able to terminate nodes if they have been in an unreachable state for too long. This has happened to us in the past and as far as I can tell spotio for example already handles this case. We experienced such a case of the node becoming unreachable when the kubelet on the node died.

This pr introduces a new field to the nodepool unreachableTimeout which can be set to e.g. 10 minutes so that Karpenter would actively terminate a node when it has been unreachable for more than 10 minutes.

We called it notready controller as that's the state the nodes are in when they become unreachable but there might be a better alternative.

How was this change tested?

We added a test suite for this case and we also tested it on one of our EKS test clusters where we simulated a node becoming unreachable and had Karpenter mark the nodeclaim for deletion.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

linux-foundation-easycla[bot] commented 1 month ago

CLA Signed

The committers listed above are authorized under a signed CLA.

k8s-ci-robot commented 1 month ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mariuskimmina Once this PR has been reviewed and has the lgtm label, please assign ellistarn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubernetes-sigs/karpenter/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
k8s-ci-robot commented 1 month ago

Welcome @mariuskimmina!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot commented 1 month ago

Hi @mariuskimmina. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
mariuskimmina commented 1 month ago

I think this does count as corporate contribution, it's the first time our company does it tho, so bare with me while I am trying to figure the CLA stuff out.

njtran commented 1 month ago

@mariuskimmina fyi if you haven't seen or were aware of, the @engedaam opened up an RFC that seems to tackle the same set of issues :) https://github.com/kubernetes-sigs/karpenter/pull/1768

mariuskimmina commented 1 month ago

@mariuskimmina fyi if you haven't seen or were aware of, the @engedaam opened up an RFC that seems to tackle the same set of issues :) #1768

@njtran thanks for the heads up, his approach does seem more well thought out - I am not sure how I should proceed from here

engedaam commented 1 month ago

Hey @mariuskimmina, I'm currently planning on handling the implementation. This is a problem space we are trying to move quickly on to help solve for users. We can close this PR out. If you have the time I would appropriate any and all feedback you can provide on both the RFC and implantation

k8s-ci-robot commented 3 weeks ago

PR needs rebase.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
mariuskimmina commented 2 weeks ago

Closing in favor of https://github.com/kubernetes-sigs/karpenter/pull/1793