kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
637 stars 206 forks source link

Karpenter stops creating new nodes b/c of broken nodeclaims #1767

Closed sergii-auctane closed 1 month ago

sergii-auctane commented 1 month ago

Description

We have 5 different node pools, 2 of them are for jobs, we run almost all the jobs in those nodepools. So when no job pods we don't have any nodes in thouse node pools. Thus in this node pools we have continuous rotation of nodes. Karppenter can create up to 20 nodes then terminate them create another 10 nodes in 5 minutes and terminate 5 of them, then create another 10 and so on.

Observed Behavior: Sometimes, karpenter gets stuck and stops creating nodes because of broken nodeclaims. image

{"level":"ERROR","time":"2024-10-22T12:54:57.842Z","logger":"controller","caller":"controller/controller.go:261","message":"Reconciler error","commit":"5bdf9c3","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"eks-stage-linux-jobs-amd64-crfm6"},"namespace":"","name":"eks-stage-linux-jobs-amd64-crfm6","reconcileID":"cf1f7f5e-ea69-4397-9e61-8e180c468e64","error":"removing termination finalizer, NodeClaim.karpenter.sh \"eks-stage-linux-jobs-amd64-crfm6\" is invalid: spec: Invalid value: \"object\": spec is immutable"}
{"level":"ERROR","time":"2024-10-22T12:55:57.852Z","logger":"controller","caller":"controller/controller.go:261","message":"Reconciler error","commit":"5bdf9c3","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"eks-stage-linux-jobs-amd64-crfm6"},"namespace":"","name":"eks-stage-linux-jobs-amd64-crfm6","reconcileID":"2f7d0952-2db3-483f-bf0b-a378e37fd567","error":"removing termination finalizer, NodeClaim.karpenter.sh \"eks-stage-linux-jobs-amd64-crfm6\" is invalid: spec: Invalid value: \"object\": spec is immutable"}
{"level":"ERROR","time":"2024-10-22T12:56:57.862Z","logger":"controller","caller":"controller/controller.go:261","message":"Reconciler error","commit":"5bdf9c3","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"eks-stage-linux-jobs-amd64-crfm6"},"namespace":"","name":"eks-stage-linux-jobs-amd64-crfm6","reconcileID":"1fa64afe-bd32-44ae-870b-ef97ecdfb14b","error":"removing termination finalizer, NodeClaim.karpenter.sh \"eks-stage-linux-jobs-amd64-crfm6\" is invalid: spec: Invalid value: \"object\": spec is immutable"}

It starts creating new nodes after i remove finalizers in the broken nodeclaim manually. I run karpenter in 5 clusters, and this issue appears randomly in some if them. I hoped that fix, mentioned it this issue would help, but it didn't.

As a temporary fix, I have a job that removes finalizers for the broken nodeclaims, but this job does not take into account node claims with a null value in the NODE field. I did this intentionally, as new nodeclaims do not have a node name in the NODE field. So i might need to extend it and remove finalizers for nodeclaim older than 10 minutes without node assigned.

resources:
- apiVersion: batch/v1
  kind: CronJob
  metadata:
    name: nodeclaim-cleaner
  spec:
    schedule: "*/10 * * * *" # Every 10 minutes
    jobTemplate:
      spec:
        template:
          spec:
            serviceAccountName: nodeclaim-job-sa
            containers:
              - name: nodeclaim-cleaner
                image: bitnami/kubectl:1.29.9
                command:
                  - /bin/bash
                  - -c
                  - |
                    #!/bin/bash

                    # Get all NodeClaims and their associated node names
                    nodeclaims_and_nodes=$(kubectl get nodeclaim -o json | jq -r '.items[] | "\(.metadata.name) \(.status.nodeName)"')
                    echo "ALL PAIRS OF NODECLAIMS AND NODES:"
                    for nodeclaim_node in $nodeclaims_and_nodes; do echo $nodeclaim_node; done
                    # Get all nodes' names
                    nodes=$(kubectl get nodes -o json | jq -r '.items[].metadata.name')

                    # Filter NodeClaims that are not associated with any existing nodes and do not have a null nodeName
                    filtered_nodeclaims=$(echo "$nodeclaims_and_nodes" | grep -vFf <(echo "$nodes") | grep -v null | awk '{print $1}')
                    echo "FILTERED NODECLAIMS:"
                    for filtered_node in $filtered_nodeclaims; do echo $filtered_node; done

                    # Remove the finalizers from the filtered NodeClaims
                    echo "REMOVING FINALIZERS FROM NODECLAIMS:"
                    for nodeclaim in $filtered_nodeclaims; do
                      kubectl patch nodeclaim "$nodeclaim" -p '{"metadata":{"finalizers":[]}}' --type=merge
                    done
            restartPolicy: OnFailure
- apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRole
  metadata:
    name: nodeclaim-manager
  rules:
    - apiGroups: [ "karpenter.sh" ]
      resources: [ "nodeclaims" ]
      verbs: [ "get", "patch", "list" ]
    - apiGroups: [""]
      resources: ["nodes"]
      verbs: ["get", "list"]
- apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRoleBinding
  metadata:
    name: nodeclaim-manager-binding
  subjects:
    - kind: ServiceAccount
      name: nodeclaim-job-sa
      namespace: {{ .Release.Namespace }}
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: nodeclaim-manager
- apiVersion: v1
  kind: ServiceAccount
  metadata:
    name: nodeclaim-job-sa

Expected Behavior: I expected I would not need to have the job that removes finalizers for broken nodeclaims and karpenter would do it instead. Or at least it would continue creating new nodes when the issue happens.

Reproduction Steps (Please include YAML): I don't know how to reproduce it; the most obvious way is to create a cluster with nodepool and run several cronjobs, which bring 2k-9k pods at each 0,15,30 minutes to force nodes rotation.

Versions:

k8s-ci-robot commented 1 month ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
jmdeal commented 1 month ago

Chart Version: 1.0.0, but karpenter image version is 1.0.6

The commit in your logs, 5bdf9c3, indicates that you're still running v1.0.0, not v1.0.6. How did you go about updating the controller?

sergii-auctane commented 1 month ago

@jmdeal By setting image tag to 1.0.6

spec:
...
  containers:
  - env:
     ...
    - name: NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME
      value: public.ecr.aws/karpenter/controller:1.0.6@sha256:1eb1073b9f4ed804634aabf320e4d6e822bb61c0f5ecfd9c3a88f05f1ca4c5c5
    ...
    image: public.ecr.aws/karpenter/controller:1.0.6@sha256:1eb1073b9f4ed804634aabf320e4d6e822bb61c0f5ecfd9c3a88f05f1ca4c5c5
    imagePullPolicy: IfNotPresent

UPD I just realized there is a digest: sha256:1eb1073b9f4ed804634aabf320e4d6e822bb61c0f5ecfd9c3a88f05f1ca4c5c5 presents as well and it hasn't changed.
I set controller.image.digest to null. I assume we can close this then.

sergii-auctane commented 1 month ago

Thanks image

jmdeal commented 1 month ago

I would recommend upgrading the chart and not just the image. The versions are coupled and there can be changes to the chart on patch versions (we've made a few updates on 1.0 w.r.t. the conversion webhooks). This may work fine, but I'd upgrade the entire chart as a best practice.

sergii-auctane commented 1 month ago

Thanks. I wasn't able to find it initially, but then realised it was released not under the tag 1.0.6 or main branch, but in release-v1.0.6 branch