Node Repair - Githubissues

jbouricius commented 2 years ago

Tell us about your request Allow a configurable expiration of NotReady nodes.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I am observing some behavior in my cluster where occasionally nodes fail to join the cluster, due to some transient error in the kubelet bootstrapping process. These nodes stay in NotReady status. Karpenter continues to assign pods to these nodes, but the k8s scheduler won't schedule to them, leaving pods in limbo for extended periods of time. I would like to be able to configure Karpenter with a TTL for nodes that failed to become Ready. The existing configuration spec.provider.ttlSecondsUntilExpiration doesn't really work for my use case because it will terminate healthy nodes.

Are you currently working around this issue? Manually deleting stuck nodes.

Additional context Not sure if this is useful context, but I observed this error on one such stuck node. From /var/log/userdata.log:

Job for sandbox-image.service failed because the control process exited with error code. See "systemctl status sandbox-image.service" and "journalctl -xe" for details.

and then systemctl status sandbox-image.service:

  sandbox-image.service - pull sandbox image defined in containerd config.toml
   Loaded: loaded (/etc/systemd/system/sandbox-image.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-06-28 18:47:42 UTC; 2h 9min ago
  Process: 4091 ExecStart=/etc/eks/containerd/pull-sandbox-image.sh (code=exited, status=2)
 Main PID: 4091 (code=exited, status=2)

From reading others issues it looks like this AMI script failed, possibly in the call to ECR: https://github.com/awslabs/amazon-eks-ami/blob/master/files/pull-sandbox-image.sh

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

ellistarn commented 2 years ago

htoo97 commented 2 years ago

We recently started using Karpenter for some batch jobs and are running into this as well, where nodes that get stuck on NotReady cause the pod to never get scheduled. The underlying node's reason turned out to be the subnets being full and the CNI pods never coming up as a result, but regardless of the cause, big +1 to having a configurable way in Karpenter to ensure bad nodes can get terminated automatically if it never comes up within some TTL.

ellistarn commented 2 years ago

I'd love to get this prioritized. It should be straightforward to implement in the node controller.

htoo97 commented 2 years ago

If it's not being worked on internally yet, I can take a stab at this!

ellistarn commented 1 year ago

korenyoni commented 1 year ago

@tzneal makes a good point here that this auto-repair feature can potentially get out of hand if each node provisioned becomes NotReady, in case of something like bad userdata configured at the Provisioner / NodeTemplate level.

This could possibly also be an issue with EC2 service outages.

Maybe you would have to implement some sort of exponential backoff per Provisioner to prevent this endless cycle of provisioning nodes that will always come up as NotReady.

wkaczynski commented 1 year ago

:+1: We're occasionally seeing cases when a node has been launched but never properly initialized (failing to get karpenter.sh/initialized=true) - as these nodes are treated as a capacity already arranged to be available in future, they can prevent cluster expansion and cause pods to be permanently stuck (constantly nominated by karpenter to run on a node that will never complete the initialization)

ellistarn commented 1 year ago

@wkaczynski, it's a bit tricky.

If we delete nodes that fail to initialize, users will have a hard time debugging
If we ignore nodes the fail to initialize, you can get runaway scaling.

dschaaff commented 1 year ago

We currently have a support ticket open with AWS for occasional bottlerocket boot failures on our Kubernetes nodes. The failure rate is very low and it's important that we are able to get logs off a node and potentially take a snapshot of the volumes. In this scenario it's vital that we can opt out of Karpenter auto removing the node. I'd be in favor of this at least being configurable so users can decide.

ellistarn commented 1 year ago

@njtran re: the behaviors API.

wkaczynski commented 1 year ago

it's vital that we can opt out of Karpenter auto removing the node

If we delete nodes that fail to initialize, users will have a hard time debugging

I also think that if we do decide to delete nodes that failed to initialize, there should be an option to opt-out to be able to debug (or if we don't delete by default - an opt-in option to enable the cleanup).

The cleanup does not even need to be a provisioner config - initially, until there it a better way to address this issue, it could be enabled via a helm chart value and exposed as either a cm setting, command line option, or an env var.

Another thing is - are these nodes considered as in-flight indefinitely ? If yes, is there currently an option for at least the in-flight status timeout ? If there isn't an option for these nodes to no longer be considered in-flight, do I understand correctly that this can effectively block the cluster expansion even if there is a one-off node initialization failure (that we're sometimes experiencing with aws).

If we ignore nodes the fail to initialize, you can get runaway scaling.

there are cases in which runaway scaling is preferred over a service interruption, would be good to have an opt-in cleanup option

ellistarn commented 1 year ago

Another thing is - are these nodes considered as in-flight indefinitely ?

I like to think about this as ttlAfterNotReady. @wkaczynski, do you think this is reasonable? You could repurpose the same mechanism to cover cases where nodes fail to connect, or eventually disconnect. We'd need to be careful to not kill nodes that have any pods on them (unless they're stuck deleting), since we don't want to burn down the fleet during an AZ outage. I'd also like this to fall under our maintenance windows controls as discussed in aws/karpenter#1738.

jalaziz commented 1 year ago

We recently ran into an issue where an EC2 node failed to become "Ready". We reached out to the AWS support team and they mentioned it was an OS level issue where the instance failed to talk to IMDS.

The end result was a bunch of pods waiting to be scheduled because Karpenter thought the node would eventually become "Ready". It was stuck that way for 5 hours before we manually terminated the node.

wkaczynski commented 1 year ago

Another thing is - are these nodes considered as in-flight indefinitely ?

I like to think about this as ttlAfterNotReady. @wkaczynski, do you think this is reasonable?

This seems reasonable but I guess the case where nodes fail to initialize is more clear cut and probably a lot easier to address - these nodes are generally safe to be deleted as there is nothing running on them yet.

In case of clusters with spiky workloads, lack of any solution for nodes that fail to initialize (with the combination of considering them in-flight) can be a regular cause of outages as the cluster will not be expanded until the nodes are manually removed.

I understand that in the worst case - if the node startup issues are permanent (for instance due to misconfiguration):

if we just time out their in-flight status, we may end-up with karpenter spinning-up more and more uninitialized nodes
if we delete nodes failing to initialize, we may end-up in a create-destroy loop (but at least the uninitialized instances will not be piling up)

You could repurpose the same mechanism to cover cases where nodes fail to connect, or eventually disconnect. We'd need to be careful to not kill nodes that have any pods on them (unless they're stuck deleting), since we don't want to burn down the fleet during an AZ outage.

It makes sense to (eventually) have a unified way of addressing this but I fear that if we aim to start with this unified solution, this will take a lot more time to address (as the Ready -> NotReady cases are not that clear cut) and the failed initialization cases tend to occur more frequently (at least in our case where we have a high node churn).

billrayburn commented 1 year ago

The current plan is that Node Ownership will reduce the occurrence of nodes not registering, and introducing ttlAfterNotRegistered in https://github.com/aws/karpenter-core/pull/191 will also help with the node repair issue. You can track work for Node Ownership and enabling ttlAfterNotRegistered in https://github.com/aws/karpenter-core/pull/176 .

sidewinder12s commented 1 year ago

Another option might help with debugging be to introduce an annotation for nodes that will block karpenter from doing anything with them. We've added that to our internal autoscaler and I've used termination protection in the past to block OSS Autoscaler actions so we can keep it around for debugging.

DaspawnW commented 1 year ago

I would like to mention that aws/karpenter#3428 also brought up the topic of unhealthy nodes not only during initialisation but during normal work.

We so far usually run our workloads via autoscaling groups & LoadBalancer with some kind of health check or autoscaling groups that run some custom script for health check, if this fails the ASG terminates the instance and replaces it with a new node. I currently see no way that Karpenter is able to handle this, right?

runningman84 commented 1 year ago

Can we already use this settings?

ttlAfterNotRegistered: 15m

We did not find it in the current karpenter documentation...

korenyoni commented 1 year ago

Can we already use this settings?
ttlAfterNotRegistered: 15m
We did not find it in the current karpenter documentation...

https://github.com/aws/karpenter-core/pull/176 was only merged on March 8th, a day after 0.27.0 was released.

So I think you need to wait for 0.28.0.

@jonathan-innis can you please confirm?

korenyoni commented 1 year ago

For example you can see 0.27.0 is still handling only the Node resource and not both v1alpha5.Machine and Node

https://github.com/aws/karpenter-core/blob/7d58c3cee0fa997750d09a43b2037c69437857e3/pkg/controllers/deprovisioning/consolidation.go#L189-L273

Cross reference it with the equivalent unreleased code

https://github.com/aws/karpenter-core/blob/62fe4ac537b8e381bbb11bd344bb2f05850cb619/pkg/controllers/deprovisioning/consolidation.go#L106-L190

jonathan-innis commented 1 year ago

Can we already use this settings?
ttlAfterNotRegistered: 15m
We did not find it in the current karpenter documentation...

You won't be able to use this setting yet. The release of this setting is tied to the release of the *v1alpha5.Machine which captures the work that is needed to fix a lot of the issues that are tied to Node Ownership. Once the changes that utilize the *v1alpha5.Machine go in, this feature will be released and updates will be made to the documentation that show how you can use it. For now, it's mentioned in the core chart only as a preparatory measure before the release of the Machine.

For now, we don't have an estimate on when this work will get done but it is being actively worked on and progress should be made here relatively soon. I'll make sure to update this thread when the feature gets formally released.

maximethebault commented 1 year ago

Looks like the Machine Migration PR went in! Fantastic work!

Does that mean we will be able to use the new setting as soon as the next release of Karpenter?

jonathan-innis commented 1 year ago

@maximethebault Yes! The new timeout mechanism will be 15m for a node that doesn't register to the cluster and should be in the next minor version release.

jonathan-innis commented 1 year ago

We've also been discussing about extending this mechanism more broadly to surface a ttlAfterNotReady value in the Provisioner, with a reasonable default value and then tearing down a Machine if the Machine hasn't been ready for that period of time and is empty (according to the definition of Karpenter's emtpiness). This should solve a lot of the auto-repair issues that users are hitting, either with nodes never going initialized and never going ready or transient nodes that sit in a NotReady state for an extended period of time

maximethebault commented 1 year ago

We've also been discussing about extending this mechanism more broadly to surface a ttlAfterNotReady value in the Provisioner, with a reasonable default value and then tearing down a Machine if the Machine hasn't been ready for that period of time and is empty (according to the definition of Karpenter's emtpiness). This should solve a lot of the auto-repair issues that users are hitting, either with nodes never going initialized and never going ready or transient nodes that sit in a NotReady state for an extended period of time

Sounds good!

What would happen in the scenario described in this issue though? As the pods are in a terminating state but still there, I suppose the node will not be considered empty?

jonathan-innis commented 1 year ago

As the pods are in a terminating state but still there, I suppose the node will not be considered empty

If the pods are terminating, then they will be considered for emptiness. You can see our pod filtering logic here

marksumm commented 1 year ago

@jonathan-innis Does the newly released v0.28.0 actually contain the 15m timeout? It didn't seem to be mentioned anywhere.

gaussye commented 1 year ago

hello there, we also get this problem and want to know when the ttlAfterNotRegistered configuration will be ready?

nirroz93 commented 1 year ago

I also searched for this config, and it seems that this PR https://github.com/aws/karpenter-core/pull/250/files changed it to hardcoded value of 15min that appears in https://github.com/aws/karpenter-core/blob/main/pkg/controllers/machine/lifecycle/liveness.go#L38 (fine for my use case, just commenting for others that are searching this)

marksumm commented 1 year ago

@nirroz93 Thanks for pointing it out. If I am reading the change history correctly, then the default 15m timeout should already have been included since v0.27.1 of Karpenter (which uses the same version of karpenter-core). This is bad news for us, since the last example of a stuck node post-dates an upgrade to v0.27.3 by almost a month.

nirroz93 commented 1 year ago

@marksumm - yes, but this is part of the machine controller mechanism (unlike node) - that was only included on 0.28

nirroz93 commented 1 year ago

BTW, for karpenter maintainers - it would be nice if you would add a link in release notes to the upgrade notes in the docs (https://karpenter.sh/docs/upgrade-guide/#upgrading-to-v0280 for example)

marksumm commented 1 year ago

I did some testing with v0.28.1 and the default node registration TTL seems to be working as expected:

2023-06-28T15:09:37.185Z DEBUG controller.machine.lifecycle terminating machine due to registration ttl {"commit": "30fa8f3-dirty", "machine": "xxxxxxxx-k4xm5", "provisioner": "xxxxxxxx", "ttl": "15m0s"}

LolloneS commented 1 year ago

For my understanding, since I am facing a similar issue with 0.29.2:

ttlAfterNotRegistered was introduced and can be used
ttlAfterNotReady will be implemented for a broader use-case

Correct? Or am I missing something?

njtran commented 1 year ago

Almost. We hard code the ttlAfterNotRegistered as 15 minutes in code. ttlAfterNotReady is the topic of discussion in this issue and hasn't been implemented.

sylr commented 10 months ago

I'm very much interested in ttlAfterNotReady.

I've a use case where a node becoming NotReady and re-becoming Ready after the scheduler decided to re-schedule the pods that were on the unreachable node is a big problem.

I could really use ttlAfterNotReady in order to remove unreachable nodes swiftly.

Thank you.

hitsub2 commented 9 months ago

In my case, When provided the kubelet args(currently not supported by Karpenter), some nodes(2 out of 400) are not ready and karpenter can not disrupt them, leaving them forever. After changing AMIFamily to Custom, this issue does not happend again.

tip-dteller commented 9 months ago

Hi, ive been referred to this ticket and adding the case we've hit.

Description

Observed Behavior: Background: Windows application deployed on Karpenter Windows Nodes - Ami family 2019, OnDemand nodes. When the application becomes unresponsive and enters CrashLoopBackOff, its breaks containerd 1.6.18. The given error is:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing open //./pipe/containerd-containerd: The system cannot find the file specified."

Side note: AWS recently updated Windows containerd from 1.6.8 to 1.6.18.

Why does it matter for this case and how its relevant to Karpenter?

The node was created and was functional, it was in "Ready" state as noted by Karpenter and the Windows pods were successfully scheduled onto it. So far so good. When the application unexpectedly broke containerd and subsequently kubelet, the node enters "NotReady" state. During this cycle,

At this point, the node is not deprovisioned and prompts this in the events:

karpenter  Cannot deprovision Node: Nominated for a pending pod

Summary of events:

Deploy Windows application
Application is in Pending mode.
Karpenter provisions a functional windows node.
Application loads and enters Running state (its functional).
Application breaks containerd after N time.
Node is unresponsive.
Karpenter cannot deprovision the node because Old pods terminated and New pods are Pending.

Expected Behavior:

Detect that the node is unresponsive and roll it (i.e create a replacement node).

Reproduction Steps (Please include YAML): Provisioner:

---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: windows-provisioner
spec:
  consolidation:
    enabled: false
  limits:
    resources:
      cpu: 200
  labels:
    app: workflow
  ttlSecondsAfterEmpty: 300
  taints:
    - key: "company.io/workflow"
      value: "true"
      effect: "NoSchedule"
  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["c", "m", "r"]
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values: ["4", "8", "16", "32", "48", "64"]
    - key: "karpenter.k8s.aws/instance-generation"
      operator: Gt
      values: ["4"]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-east-1a", "us-east-1b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type" 
      operator: In
      values: ["on-demand"] 
    - key: kubernetes.io/os
      operator: In
      values: ["windows"]

  providerRef:
    name: windows

I cannot provide the windows application as it entails business logic.

-- I Couldnt anything in Karpenter documentation that states this is normal behavior and I hope for some clarity here.

Versions:

Chart Version: v0.31.0
Kubernetes Version (kubectl version): Server Version: v1.24.17-eks-f8587cb, Client Version: v1.28.3

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sylr commented 6 months ago

/remove-lifecycle stale

Bryce-Soghigian commented 6 months ago

Maybe you would have to implement some sort of exponential backoff per Provisioner to prevent this endless cycle of provisioning nodes that will always come up as NotReady.

Cluster Autoscaler has the concept of IsClusterHealthy, and IsNodegroupHealthy, alongside the ok-total-unready-count flags and the max-unready percentage flags to control at what threshold when IsClusterHealthy should be triggered.

IsClusterHealthy blocks autoscaling until the health resolves. Im not convinced that karpenter core is the right place to solve provisioning failures with this type of IsClusterHealthy concept, but worth mentioning. CAS has historically dealt with a lot of bug reports for this very blocking behavior, and karpenter nodepools are not of a single type. So GPU Provisioning being broken shouldn't block all other instance types.

Instead it might make sense for Provisioning Backoff to live inside the cloud provider and leverage the unavailable offerings cache inside of karpenter. If a given sku and node image has failed x times for this nodeclaim, we add it to the unavailable offerings cache? Then let it expire and we retry that permutation later.(this pattern would work with azure, have to read through AWS Pattern on this)

It would be much better to not block all of the provisioning for a given nodepool, and instead do it per instance type

Node Auto Repair General Notes From my AKS Experience

I was the engineer that built the AKS Node Auto repair framework. Some notes based on that experience

The expectation generally is that cluster autoscaler garbage collects the unready nodes after 20 minutes(max-total-unready-time flag)

Separately from CAS Lifecycle, AKS will attempt 3 autohealing actions on the node that is not ready.

Restart the vm: Useful for rebooting the kubelet etc
Reimage the vm: Solves for Corrupted states etc
Redeploy the vm: this will solve any problem due to some host level error.

These actions fix many customer nodes each day, but would be good to unify the autoscalers repair attempts alongside the remediator.

While I am all for moving node lifecycle actions from other places into karpenter, It would have to be solved by cloudprovider APIS. The remediation actions defined one cloud provider may not have an equivalent action inside of another cloud provider. We would have to design a symbiotic relationship carefully.

1ms-ms commented 4 months ago

@Bryce-Soghigian Im not an expert, but these 3 actions you mentioned

Restart the vm: Useful for rebooting the kubelet etc

Reimage the vm: Solves for Corrupted states etc

Redeploy the vm: this will solve any problem due to some host level error.

seem to be possible to implement via SDK for all major cloud providers. From the thread I can't get what's the obstacle right now, especially since rebooting/terminating will solve most of the problems with kubelet being not responsive.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jessebye commented 1 month ago

/remove-lifecycle stale

ibalat commented 1 week ago

hi, I guess this issue related with this topic. I can give any log for debugging: https://github.com/kubernetes-sigs/karpenter/issues/1573

tculp commented 1 week ago

Another use case is to recover when a node runs out of memory and goes down, never to come up again without manual intervention.

kubernetes-sigs / karpenter

Node Repair #750

Community Note

Description