kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
539 stars 180 forks source link

Failure to delete NodeClaim (Reconciler error) #1578

Open artem-nefedov opened 4 weeks ago

artem-nefedov commented 4 weeks ago

Description

Observed Behavior:

When deleting NodePools, one of the NodeClaims got stuck and couldn't be deleted. The Node itself does get deleted, and is no longer shown in get nodes output, and the EC2 instance is terminated.

Controller logs show an error:

{"level":"ERROR","time":"2024-08-16T19:21:14.386Z","logger":"controller","message":"Reconciler error","commit":"5bdf9c3","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"system-6ss2w"},"namespace":"","name":"system-6ss2w","reconcileID":"ea1cb048-d9a1-41b9-95c6-2c89fca2402e","error":"removing termination finalizer, NodeClaim.karpenter.sh \"system-6ss2w\" is invalid: spec: Invalid value: \"object\": spec is immutable"}

Besides this one particular NodeClaim, other NodeClaims were deleted successfully.

Note: Karpenter is installed with webhook disabled.

Expected Behavior:

NodeClaim is deleted.

Reproduction Steps (Please include YAML):

  1. Create the following nodeclass and 2 nodepools:
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: myorg-default
spec:
  amiSelectorTerms:
  - alias: bottlerocket@latest
  role: KarpenterNodeRole-${CLUSTER_NAME}
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: ${CLUSTER_NAME}
      kubernetes.io/role/internal-elb: "1"
  securityGroupSelectorTerms:
  - tags:
      aws:eks:cluster-name: ${CLUSTER_NAME}
  tags:
    karpenter.sh/discovery: ${CLUSTER_NAME}
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: apps
spec:
  disruption:
    consolidateAfter: 30s
    consolidationPolicy: WhenEmptyOrUnderutilized
  template:
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: myorg-default
      requirements:
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
        - amd64
      - key: karpenter.k8s.aws/instance-memory
        operator: Gt
        values:
        - "1024"
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "2"
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: system
spec:
  disruption:
    consolidateAfter: 30s
    consolidationPolicy: WhenEmptyOrUnderutilized
  template:
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: myorg-default
      requirements:
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
        - amd64
      - key: karpenter.k8s.aws/instance-memory
        operator: Gt
        values:
        - "1024"
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "2"
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: system
spec:
  disruption:
    consolidateAfter: 30s
    consolidationPolicy: WhenEmptyOrUnderutilized
  template:
    metadata:
      labels:
        myorg.com/system: shared
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: myorg-default
      requirements:
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
        - amd64
      - key: karpenter.k8s.aws/instance-memory
        operator: Gt
        values:
        - "1024"
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "2"
      - key: karpenter.k8s.aws/instance-memory
        operator: Gt
        values:
        - "2048"
      - key: karpenter.k8s.aws/instance-cpu
        operator: Gt
        values:
        - "1"
      taints:
      - effect: NoSchedule
        key: myorg.com/system
        value: "true"
  1. Deploy some apps that use both nodepools (including daemonsets).
  2. Delete nodepools and nodeclass.

Versions:

k8s-ci-robot commented 4 weeks ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
jigisha620 commented 4 weeks ago

Hi @artem-nefedov , Were you able to check the status conditions on the nodeClaim that did not get deleted? Can you share the nodeClaim that was not deleted?

artem-nefedov commented 4 weeks ago

@jigisha620 Sure, I still have it in this state for now.

apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
  annotations:
    compatibility.karpenter.k8s.aws/cluster-name-tagged: "true"
    compatibility.karpenter.k8s.aws/kubelet-drift-hash: "15379597991425564585"
    karpenter.k8s.aws/ec2nodeclass-hash: "14860904998044214408"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
    karpenter.k8s.aws/tagged: "true"
    karpenter.sh/nodepool-hash: "2979110132923797022"
    karpenter.sh/nodepool-hash-version: v3
  creationTimestamp: "2024-08-16T16:15:45Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-08-16T19:17:08Z"
  finalizers:
  - karpenter.sh/termination
  generateName: system-
  generation: 2
  labels:
    myorg.com/system: shared
    karpenter.k8s.aws/instance-category: t
    karpenter.k8s.aws/instance-cpu: "2"
    karpenter.k8s.aws/instance-cpu-manufacturer: amd
    karpenter.k8s.aws/instance-ebs-bandwidth: "2085"
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "false"
    karpenter.k8s.aws/instance-family: t3a
    karpenter.k8s.aws/instance-generation: "3"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "4096"
    karpenter.k8s.aws/instance-network-bandwidth: "256"
    karpenter.k8s.aws/instance-size: medium
    karpenter.sh/capacity-type: spot
    karpenter.sh/nodepool: system
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: t3a.medium
    topology.k8s.aws/zone-id: euc1-az3
    topology.kubernetes.io/region: eu-central-1
    topology.kubernetes.io/zone: eu-central-1b
  name: system-6ss2w
  ownerReferences:
  - apiVersion: karpenter.sh/v1
    blockOwnerDeletion: true
    kind: NodePool
    name: system
    uid: 89ecb70a-b2c1-43a7-9025-9f8dbc32f0fd
  resourceVersion: "2305076"
  uid: c176449e-6aea-40ca-9e82-0eee021f86a5
spec:
  expireAfter: 720h
  nodeClassRef:
    kind: EC2NodeClass
    name: myorg-default
  requirements:
  - key: karpenter.k8s.aws/instance-generation
    operator: Gt
    values:
    - "2"
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c3.large
    - c4.large
    - c5.large
    - c5a.large
    - c5ad.large
    - c5d.large
    - c5n.large
    - c6a.large
    - c6g.large
    - c6g.xlarge
    - c6gd.large
    - c6gn.large
    - c6i.large
    - c6id.large
    - c6in.large
    - c7a.large
    - c7g.large
    - c7gd.large
    - c7i.large
    - i3.large
    - m3.large
    - m4.large
    - m5.large
    - m5a.large
    - m5ad.large
    - m5d.large
    - m5dn.large
    - m5n.large
    - m6a.large
    - m6g.large
    - m6g.xlarge
    - m6gd.large
    - m6i.large
    - m6id.large
    - m6in.large
    - m7a.large
    - m7g.large
    - m7gd.large
    - m7i-flex.large
    - m7i.large
    - r4.large
    - r5.large
    - r5a.large
    - r5ad.large
    - r5d.large
    - r5n.large
    - r6a.large
    - r6g.large
    - r6gd.large
    - r6i.large
    - r6in.large
    - r7g.large
    - r7gd.large
    - r8g.large
    - t3.large
    - t3.medium
    - t3a.large
    - t3a.medium
    - t4g.large
    - t4g.medium
  - key: karpenter.k8s.aws/instance-cpu
    operator: Gt
    values:
    - "1"
  - key: myorg.com/system
    operator: In
    values:
    - shared
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - system
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
    - arm64
  - key: karpenter.k8s.aws/instance-memory
    operator: Gt
    values:
    - "2048"
  resources:
    requests:
      cpu: 350m
      memory: 500Mi
      pods: "4"
  taints:
  - effect: NoSchedule
    key: myorg.com/system
    value: "true"
status:
  allocatable:
    cpu: 1930m
    ephemeral-storage: 17Gi
    memory: 3246Mi
    pods: "17"
  capacity:
    cpu: "2"
    ephemeral-storage: 20Gi
    memory: 3788Mi
    pods: "17"
  conditions:
  - lastTransitionTime: "2024-08-16T16:25:48Z"
    message: ""
    reason: ConsistentStateFound
    status: "True"
    type: ConsistentStateFound
  - lastTransitionTime: "2024-08-16T16:17:18Z"
    message: ""
    reason: Initialized
    status: "True"
    type: Initialized
  - lastTransitionTime: "2024-08-16T19:17:12Z"
    message: ""
    reason: InstanceTerminating
    status: "True"
    type: InstanceTerminating
  - lastTransitionTime: "2024-08-16T16:15:48Z"
    message: ""
    reason: Launched
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-08-16T16:17:18Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-08-16T16:16:09Z"
    message: ""
    reason: Registered
    status: "True"
    type: Registered
  imageID: ami-0db56ae3f82f8a96b
  lastPodEventTime: "2024-08-16T19:17:09Z"
  nodeName: ip-192-168-140-0.eu-central-1.compute.internal
  providerID: aws:///eu-central-1b/i-0c4d08973be5147dd

Also, ec2nodeclass is also stuck and not deleted yet (but all nodepools are gone).

artem-nefedov commented 4 weeks ago

I tried to have the whole procedure again, and this time I got 4 out of 6 NodeClaims stuck in this state (while all actual Nodes were deleted). On previous attempt, it was 1 out of 6. It looks like the problem is happening very often with v1. I never seen this problem on karpenter 0.37.0 with v1beta1 api.

Update: I noticed that on a second attempt, all NodeClaims have v1beta1 apiVersion, despite NodePool being v1. This was not the case on the first attempt, as seen above. This is very confusing. My guess is that it may be related to Karpenter being installed with conversion webhook disabled.

jigisha620 commented 3 weeks ago

Did you apply the v1 Karpenter CRDs when you migrated to v1.0.0?

artem-nefedov commented 3 weeks ago

Did you apply the v1 Karpenter CRDs when you migrated to v1.0.0?

Yes. It's a clean install of 1.0.0 helm chart on a new cluster, not an upgrade.

jigisha620 commented 3 weeks ago

Can you check the storage version for nodeClaims using kubectl get crd nodeclaims.karpenter.sh -o jsonpath='{.spec.versions[?(@.storage==true)].name}'?

artem-nefedov commented 3 weeks ago

@jigisha620 Sorry, the clusters where this happened are gone now. I've tried to reproduce the issue from scratch on new clusters, but no success so far.

virtualdom commented 3 weeks ago

@jigisha620 I'm running into this issue as well. In our case, we have over 800 NodeClaims failing to delete.

When I check the storage version for nodeClaims, I'm seeing this

❯ kubectl get crd nodeclaims.karpenter.sh -o jsonpath='{.spec.versions[?(@.storage==true)].name}'
v1%
jigisha620 commented 3 weeks ago

Hi @virtualdom, Are you seeing similar logs that say "error":"removing termination finalizer, that's preventing noddeClaim termination? Did you upgrade from v0.37.0 to v1.0.0 or was it a clean install?

virtualdom commented 3 weeks ago

Hi @jigisha620 yes, I'm seeing logs like that. Here's a sample

removing termination finalizer, NodeClaim.karpenter.sh "general-4bzbb" is invalid: spec: Invalid value: "object": spec is immutable

We upgraded from v0.35.6 to v1.0.0.

We're also seeing that abandoned NodeClaims are having this added to their events.

Normal DisruptionBlocked 48s (x5 over 9m8s) karpenter Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim

quercus-carsten commented 3 weeks ago

Hi, we also ran into this issue. I migrated 0.37.0 to 1.0.0.

Also took the steps documented in the upgrade guide. We're using ArgoCD for deployment. So first update CRDs then continue with karpenter. Then push the updated ressources according to documentation.

Now I am also stuck with nodeclaims showing event msg: "Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim"

What I also see is that all affected nodeclaims have invalid specs.

"Required value, spec.nodeClassRef.kind: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"

Since nodeclaims are now immutable I neither can patch the ressources to validate nor can I go ahead an remove the finalizer to fix the situation. I checked in aws console and the ec2 instances are not existent anymore.

Last but not least. The post-install-hook migrates resources for you. ArgoCD goes ahead and tries to reverse these changes. Maybe there should be a way to choose, if I want/need this migration.

simon-wessel commented 3 weeks ago

We mitigated the problem by disabling ArgoCD auto sync, editing the CRD and temporarily removing the "self == oldSelf" immutable rule. After that we manually added the nodeClassRef.group and nodeClassRef.kind values. The NodeClaims than disappeared as they were already scheduled for deletion.

virtualdom commented 3 weeks ago

@jigisha620 are there any adverse side-effects to disabling immutability on NodeClaim specs? When I manually edit the Helm chart to remove the self == oldSelf requirement, the NodeClaim deletions work as expected.

virtualdom commented 2 weeks ago

In my case, spec.nodeClassRef seems to be set properly and isn't an issue for me.

One thing I've tried doing is removing the immutability requirement from spec and adding the requirement to each property of the spec instead. In my case, resources is the offending property.

I deployed a custom Karpenter image with a few Printlns added, and I'm seeing the following

  1. If I Get the exact NodeClaim that's about to have its finalizer removed and print it, its spec seems the same as the payload that's having its finalizer removed and that's being passed into the Update
  2. When I remove the immutability requirement altogether, the Update call is successful, and when I print the output of the Update call, I see that the memory object differs like so
original payload : memory:{{4674144000 0} {<nil>}          DecimalSI}
`Update` output  : memory:{{4674144 3}    {<nil>} 4674144k DecimalSI}

Could this difference be causing the resources property to be rejected by the immutability rule?

jigisha620 commented 2 weeks ago

Hi @virtualdom, What was the memory request on the workload that you were trying to run? I am trying to reproduce this issue on my end.

woehrl01 commented 1 week ago

This issue has been quite disruptive, especially with scaling the cluster. I’m more than happy to help however I can - if there’s any additional info or context you’d need from my side, please feel free to let me know!

Meanwhile, would modifying the CRD to temporarily remove the validation be a good workaround, or do you have other recommendations?

kamilaz commented 1 week ago

I have the same problem with node termination "error":"removing termination finalizer, NodeClaim.karpenter.sh \"default-qg6vn\" is invalid: spec: Invalid value: \"object\": spec is immutable"} I migrated from 0.37.1 to 1.0.1 and this error appears on NodeClaims which were before migration

kubectl get crd nodeclaims.karpenter.sh -o jsonpath='{.spec.versions[?(@.storage==true)].name}' v1%

virtualdom commented 1 week ago

@jigisha620 sorry for the delay -- here's one example of a c6a.xlarge node in my K8s 1.28 cluster where this is happening

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests         Limits
  --------           --------         ------
  cpu                725m (18%)       2600m (66%)
  memory             4662250Ki (67%)  5817322Ki (84%)
  ephemeral-storage  0 (0%)           0 (0%)
  hugepages-1Gi      0 (0%)           0 (0%)
  hugepages-2Mi      0 (0%)           0 (0%)
jigisha620 commented 1 week ago

@virtualdom

are there any adverse side-effects to disabling immutability on NodeClaim specs? When I manually edit the Helm chart to remove the self == oldSelf requirement, the NodeClaim deletions work as expected.

There are no adverse effects of doing this. self == oldSelf basically indicates that Karpenter would not react to any change in the nodeClaim spec.

youwalther65 commented 1 week ago

I just wanted to add here that the event on a K8s node

$ kubectl describe node ip-<redacted>.eu-west-1.compute.internal
...
Events:
  Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim

is visible for me for non-Karpenter-managed nodes (have to correct myself, because my previous comment was wriong).). I am running v1.0.1 upgraded from v1.0.0 clean install and no modification of CRD. The message will be raised in source code statenode.go

woehrl01 commented 1 week ago

@youwalther65 I see this message for nodes which are not managed by karpenter. Could this be the case for you as well? It's still strange to have this message at all for non Karpenter managed nodes.

youwalther65 commented 1 week ago

@woehrl01 Sorry, my mistake, described the wrong node. But you are right, good point - I can confirm as well that all other non-Karpenter managed nodes (Managed Node Group based) see this event a lot, for example (x1283 over 42h).

virtualdom commented 1 week ago

@jigisha620 another observation, if it helps -- when I go through some of my NodeClaims and specifically look at the memory value, I see this

❯ kubectl get nodeclaim -o custom-columns=:.metadata.name,:.spec.resources.requests.memory,:".metadata.labels.node\.kubernetes\.io/instance-type"

general-25jzg                3774144000     c6a.xlarge
general-27d5b                8068500Ki      m5a.xlarge
general-2bknn                7115375Ki      m5a.xlarge
general-2kntm                4662250Ki      c6a.xlarge
...about 300 more

When I use kubectl delete node to delete a node with trailing zeroes in the memory value (i.e. general-25jzg), the node is deleted but the nodeclaim ends up in a similar state where Karpenter can't remove the finalizer and says the spec is immutable. When I delete a node that includes a suffix for its units of memory (i.e. general-27d5b), everything is deleted cleanly, including the nodeclaim.

Could there be issues with Karpenter failing to standardize .spec.resources.requests values on NodeClaim creation?

kamilaz commented 1 week ago

In my case there were an misconfiguration in nodeclass. Before migration I had AL2, and migration I apply AL2 but with ami id. After update to ami2@latest, nodes successfully deleted by karpenter

sergii-auctane commented 1 week ago

The problem is that Karpenter doesn't remove nodeclaims sometimes after the node is terminated by AWS, for example. I would not care too much, but sometimes it stops creating new nodeclaims and nodes and starts again only when I remove those nodeclaims manually, by removing finalizers. Here is the description from the broken nodeclaim.

Normal  DisruptionBlocked  4m31s (x632 over 39h)  karpenter  Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim

It started happening after the update to v1.

You can run the command below to find "broken" nodeclaims:

kubectl get nodeclaim -o json | jq -r '.items[] | "\(.metadata.name) \(.status.nodeName)"' | grep -vFf <(kubectl get nodes -o json | jq -r '.items[].metadata.name')

Error from karpenter log:

{"level":"ERROR","time":"2024-09-06T19:24:49.495Z","logger":"controller","caller":"controller/controller.go:261","message":"Reconciler error","commit":"5bdf9c3","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"eks-dev-linux-amd64-rkrj6"},"namespace":"","name":"eks-dev-linux-amd64-rkrj6","reconcileID":"7789c548-8a34-4e80-ac63-9c9a7b859358","error":"removing termination finalizer, NodeClaim.karpenter.sh \"eks-dev-linux-amd64-rkrj6\" is invalid: spec: Invalid value: \"object\": spec is immutable"}
artem-nefedov commented 1 week ago

It looks like this is purely an upgrade issue and doesn't reproduce on clean v1.0.1 installs, so I'm fine with closing this.

woehrl01 commented 1 week ago

@artem-nefedov I highly disagree. This is still a very surious issue. It happens for fresh nodes, in the case as discribed above, that the memory can be represented in a different decimal format where the comparison fails.

artem-nefedov commented 1 week ago

@woehrl01 I'm not closing it yet, but technically there's nothing to "fix" here. The problem does happen for fresh nodes, but only if karpenter was upgraded from v0.x.x (even if old version was uninstalled first, because CRDs are not removed). It does not happen on new cluster with freshly installed v1.0.1. Maybe a documentation update with a described workaround for upgrade can be considered a solution. I'm not sure, the team can decide.

engedaam commented 5 days ago

@woehrl01 @virtualdom Do you have the conversion webhooks disbaled, and nodes provisioned with karpenter v1 are getting the spec immutable error? Seems like this issue may not be related to the webhooks, but the storage version that maybe set on the CRDs. I was only able to reproduce this issue only on the upgrade path.

engedaam commented 5 days ago

/assign engedaam

woehrl01 commented 5 days ago

@engedaam I have the webhook installed, and it's happening on fresh nodes created by 1.0. I also have applied the latest CRDs (e.g. To remove the validation), is there anything I should try out for further validation?

JoseAlvarezSonos commented 4 days ago

Hello, we are experiencing the same issue. I'm currently testing on a cluster that previously had Karpenter version 0.37.0, and the deployment is being handled by ArgoCD. We followed the recommendations, updated the IAM role, we deployed the CRDs with the webhook disabled using the karpenter-crds chart, then updated the Karpenter chart (to v1.0.2), we had some issues regarding the validating webhooks that we fixed by deleting those resources. Then we upgraded our EC2NodeClasses and NodePools to v1 from v1beta1. Finally we enabled the webhooks to make sure that everything was working fine and we still see sporadically this issue:

removing termination finalizer, NodeClaim.karpenter.sh \"blabla-jmjws\" is invalid: spec: Invalid value: \"object\": spec is immutable

We can't really do a fresh install so we would appreciate a way to fix this or any guidance is welcomed.

I wrote a small script that removes the finalizers for all nodeClaims and it works, everything goes back to normal for some time (15 or 20 minutes last time I checked) and then we start seeing again the issue.

JoseAlvarezSonos commented 1 day ago

Hello a quick update on our side, I tested deploying the existing v1beta1 resources (NodeClasses and NodePools) while duplicating them for v1 with different names, then making sure that the workloads changed to the v1 ones by deleting the old nodes and later deleting the v1beta1 (by removing manually the finalizers for those nodeclaims of course) and it's working. This is practically a "fresh install" somehow. So I believe there's an issue of how Karpenter is handling v1 and v1beta1 resources at the same time, it feels like if it can only handle v1 and any v1beta1 doesn't handles them well, I would assume that if there's a webhook in place to change the version of resources it should handle also like a rollout of nodes maybe, not sure. Maybe it already does but it's our ArgoCD setup that breaks that, I don't know. But if someone is stuck with this issue this could be a "the hard and long way" solution. I hope it also helps to solve it