kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
603 stars 197 forks source link

Race condition with node startupTaints being restored #1772

Open dpiddock opened 2 days ago

dpiddock commented 2 days ago

Description

Observed Behavior: Karpenter restores the startupTaints to a node if it is removed too quickly at node startup. This results in a node being unusable. Node also never reaches a ready state, so Karpenter refuses to remove it: Cannot disrupt Node: state node isn't initialized

From AWS CloudWatch logs insights:

Expected Behavior: Karpenter updates the existing taints on a node to remove karpenter.sh/unregistered=NoExecute without restoring startup taints removed by other controllers.

Reproduction Steps (Please include YAML): This is an unpredictable race condition that is near impossible to reproduce on demand. Might be related to this code: https://github.com/rschalo/karpenter/blob/a652a4aa95dbe92159bb273a3b64ff8837d92660/pkg/controllers/nodeclaim/lifecycle/registration.go#L87

Versions:

k8s-ci-robot commented 2 days ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.