kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
577 stars 193 forks source link

karpenter fails to add a spot replacement to on-demand ones because of nodeclaim validation #1208

Open myaser opened 5 months ago

myaser commented 5 months ago

Description

Observed Behavior: for a nodepool of mixed capacity types (on-demand, and spot) karpenter tries to decommission an on-demand instance to replace it with a spot instance. it then fails to do so because the generated nodeclaim contains a requirement label of the restricted domain "karpenter.sh"

check the controller logs

karpenter-779ff45f5c-nmn5w controller {"level":"INFO","time":"2024-04-25T10:22:27.988Z","logger":"controller.disruption","message":"disrupting via consolidation replace, terminating 1 nodes (25 pods) ip-10-149-88-228.eu-central-1.compute.internal/m5.xlarge/on-demand and replacing with spot node from types m5.xlarge","commit":"6b868db-dirty","command-id":"e31d43f1-3b17-4be9-acdb-658ba38f5b95"}

karpenter-779ff45f5c-nmn5w controller {"level":"ERROR","time":"2024-04-25T10:22:28.058Z","logger":"controller.disruption","message":"disrupting via \"consolidation\", disrupting candidates, launching replacement nodeclaim (command-id: e31d43f1-3b17-4be9-acdb-658ba38f5b95), creating node claim, NodeClaim.karpenter.sh \"karpenter-default-wx8rz\" is invalid: spec.requirements[9].key: Invalid value: \"string\": label domain \"karpenter.sh\" is restricted","commit":"6b868db-dirty"}

Expected Behavior: creating spot replacements for on-demand ones should not be blocked.

Reproduction Steps (Please include YAML): node pools

apiVersion: v1
items:
- apiVersion: karpenter.sh/v1beta1
  kind: NodePool
  metadata:
    annotations:
      karpenter.sh/nodepool-hash: "3243005398540344161"
      karpenter.sh/nodepool-hash-version: v2
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"karpenter.sh/v1beta1","kind":"NodePool","metadata":{"annotations":{},"name":"karpenter-default"},"spec":{"disruption":{"consolidationPolicy":"WhenUnderutilized","expireAfter":"Never"},"template":{"metadata":{"labels":{"cluster-lifecycle-controller.zalan.do/replacement-strategy":"none","lifecycle-status":"ready","node.kubernetes.io/node-pool":"karpenter-default","node.kubernetes.io/profile":"worker-karpenter","node.kubernetes.io/role":"worker"}},"spec":{"kubelet":{"clusterDNS":["10.0.1.100"],"cpuCFSQuota":false,"kubeReserved":{"cpu":"100m","memory":"282Mi"},"maxPods":32,"systemReserved":{"cpu":"100m","memory":"164Mi"}},"nodeClassRef":{"name":"karpenter-default"},"requirements":[{"key":"node.kubernetes.io/instance-type","operator":"In","values":["m5.8xlarge","m5.xlarge"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["spot","on-demand"]},{"key":"kubernetes.io/arch","operator":"In","values":["arm64","amd64"]},{"key":"topology.kubernetes.io/zone","operator":"In","values":["eu-central-1a","eu-central-1b","eu-central-1c"]}],"startupTaints":[{"effect":"NoSchedule","key":"zalando.org/node-not-ready"}]}},"weight":1}}
    creationTimestamp: "2024-04-25T09:09:18Z"
    generation: 1
    name: karpenter-default
    resourceVersion: "1942211133"
    uid: 0d6de200-cac7-4ea3-a12d-a254b60b29f9
  spec:
    disruption:
      budgets:
      - nodes: 10%
      consolidationPolicy: WhenUnderutilized
      expireAfter: Never
    template:
      metadata:
        labels:
          cluster-lifecycle-controller.zalan.do/replacement-strategy: none
          lifecycle-status: ready
          node.kubernetes.io/node-pool: karpenter-default
          node.kubernetes.io/profile: worker-karpenter
          node.kubernetes.io/role: worker
      spec:
        kubelet:
          clusterDNS:
          - 10.0.1.100
          cpuCFSQuota: false
          kubeReserved:
            cpu: 100m
            memory: 282Mi
          maxPods: 32
          systemReserved:
            cpu: 100m
            memory: 164Mi
        nodeClassRef:
          name: karpenter-default
        requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - m5.8xlarge
          - m5.xlarge
        - key: karpenter.sh/capacity-type
          operator: In
          values:
          - spot
          - on-demand
        - key: kubernetes.io/arch
          operator: In
          values:
          - arm64
          - amd64
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - eu-central-1a
          - eu-central-1b
          - eu-central-1c
        startupTaints:
        - effect: NoSchedule
          key: zalando.org/node-not-ready
    weight: 1
  status:
    resources:
      cpu: "8"
      ephemeral-storage: 202861920Ki
      memory: 32315584Ki
      pods: "220"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Versions:

tzneal commented 5 months ago

If the ip-10-149-88-228.eu-central-1.compute.internal node is around, can you supply the node object and node claim YAML?

myaser commented 5 months ago

I found another instance (on a different cluster) where it failed to replace a spot node with another spot node. I captured the node object and nodeclaim YAMLs node object

apiVersion: v1
kind: Node
metadata:
  annotations:
    alpha.kubernetes.io/provided-node-ip: 172.31.5.136
    csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-059805a98b7e75171"}'
    flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"d6:b1:3a:ae:4a:bd"}'
    flannel.alpha.coreos.com/backend-type: vxlan
    flannel.alpha.coreos.com/kube-subnet-manager: "true"
    flannel.alpha.coreos.com/public-ip: 172.31.5.136
    karpenter.k8s.aws/ec2nodeclass-hash: "2026609550328776800"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v2
    karpenter.sh/nodepool-hash: "4369624379001278596"
    karpenter.sh/nodepool-hash-version: v2
    kubectl.kubernetes.io/last-applied-configuration: {}
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2024-05-02T12:39:31Z"
  finalizers:
  - karpenter.sh/termination
  labels:
    aws.amazon.com/spot: "true"
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m5.large
    beta.kubernetes.io/os: linux
    cluster-lifecycle-controller.zalan.do/replacement-strategy: none
    failure-domain.beta.kubernetes.io/region: eu-central-1
    failure-domain.beta.kubernetes.io/zone: eu-central-1a
    karpenter.k8s.aws/instance-category: m
    karpenter.k8s.aws/instance-cpu: "2"
    karpenter.k8s.aws/instance-cpu-manufacturer: intel
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "false"
    karpenter.k8s.aws/instance-family: m5
    karpenter.k8s.aws/instance-generation: "5"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "8192"
    karpenter.k8s.aws/instance-network-bandwidth: "750"
    karpenter.k8s.aws/instance-size: large
    karpenter.sh/capacity-type: spot
    karpenter.sh/initialized: "true"
    karpenter.sh/nodepool: default-karpenter
    karpenter.sh/registered: "true"
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-172-31-5-136.eu-central-1.compute.internal
    kubernetes.io/os: linux
    kubernetes.io/role: worker
    lifecycle-status: ready
    node.kubernetes.io/distro: ubuntu
    node.kubernetes.io/instance-type: m5.large
    node.kubernetes.io/node-pool: default-karpenter
    node.kubernetes.io/profile: worker-karpenter
    node.kubernetes.io/role: worker
    topology.ebs.csi.aws.com/zone: eu-central-1a
    topology.kubernetes.io/region: eu-central-1
    topology.kubernetes.io/zone: eu-central-1a
  name: ip-172-31-5-136.eu-central-1.compute.internal
  ownerReferences:
  - apiVersion: karpenter.sh/v1beta1
    blockOwnerDeletion: true
    kind: NodeClaim
    name: default-karpenter-hcj5f
    uid: 1c95cfac-270d-4bbf-b1c6-b8d1af38ef6f
  resourceVersion: "2533516828"
  uid: e7845ae8-042f-4e16-b31e-55ecd40ee6ac
spec:
  podCIDR: 10.2.248.0/24
  podCIDRs:
  - 10.2.248.0/24
  providerID: aws:///eu-central-1a/i-059805a98b7e75171
status: {}

nodeClaim

apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "2026609550328776800"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v2
    karpenter.k8s.aws/tagged: "true"
    karpenter.sh/nodepool-hash: "4369624379001278596"
    karpenter.sh/nodepool-hash-version: v2
    kubectl.kubernetes.io/last-applied-configuration: {}
  creationTimestamp: "2024-05-02T12:38:47Z"
  finalizers:
  - karpenter.sh/termination
  generateName: default-karpenter-
  generation: 1
  labels:
    cluster-lifecycle-controller.zalan.do/replacement-strategy: none
    karpenter.k8s.aws/instance-category: m
    karpenter.k8s.aws/instance-cpu: "2"
    karpenter.k8s.aws/instance-cpu-manufacturer: intel
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "false"
    karpenter.k8s.aws/instance-family: m5
    karpenter.k8s.aws/instance-generation: "5"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "8192"
    karpenter.k8s.aws/instance-network-bandwidth: "750"
    karpenter.k8s.aws/instance-size: large
    karpenter.sh/capacity-type: spot
    karpenter.sh/nodepool: default-karpenter
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    lifecycle-status: ready
    node.kubernetes.io/instance-type: m5.large
    node.kubernetes.io/node-pool: default-karpenter
    node.kubernetes.io/profile: worker-karpenter
    node.kubernetes.io/role: worker
    topology.kubernetes.io/region: eu-central-1
    topology.kubernetes.io/zone: eu-central-1a
  name: default-karpenter-hcj5f
  ownerReferences:
  - apiVersion: karpenter.sh/v1beta1
    blockOwnerDeletion: true
    kind: NodePool
    name: default-karpenter
    uid: 2536b136-fc71-40a9-a233-f51b81120e97
  resourceVersion: "2533447771"
  uid: 1c95cfac-270d-4bbf-b1c6-b8d1af38ef6f
spec:
  kubelet:
    clusterDNS:
    - 10.0.1.100
    cpuCFSQuota: false
    kubeReserved:
      cpu: 100m
      memory: 282Mi
    maxPods: 32
    systemReserved:
      cpu: 100m
      memory: 164Mi
  nodeClassRef:
    name: default-karpenter
  requirements:
  - key: topology.kubernetes.io/region
    operator: In
    values:
    - eu-central-1
  - key: karpenter.k8s.aws/instance-size
    operator: NotIn
    values:
    - metal
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
    - arm64
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - eu-central-1a
  - key: node.kubernetes.io/node-pool
    operator: In
    values:
    - default-karpenter
  - key: node.kubernetes.io/profile
    operator: In
    values:
    - worker-karpenter
  - key: node.kubernetes.io/role
    operator: In
    values:
    - worker
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - default-karpenter
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c5.xlarge
    - c5d.xlarge
    - c6i.xlarge
    - c6id.xlarge
    - c6in.xlarge
    - m5.large
    - m5.xlarge
    - m5d.large
    - m5d.xlarge
    - m5n.large
    - m5n.xlarge
    - m6i.large
    - m6i.xlarge
    - m6id.large
    - m6in.large
    - r5.large
    - r5d.large
    - r5n.large
    - r6i.large
    - r6i.xlarge
    - r6id.large
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values:
    - c5
    - c5d
    - c5n
    - c6i
    - c6id
    - c6in
    - m5
    - m5d
    - m5n
    - m6i
    - m6id
    - m6in
    - r5
    - r5d
    - r5n
    - r6i
    - r6id
    - r6in
  - key: cluster-lifecycle-controller.zalan.do/replacement-strategy
    operator: In
    values:
    - none
  - key: lifecycle-status
    operator: In
    values:
    - ready
  resources:
    requests:
      cpu: 1517m
      ephemeral-storage: 2816Mi
      memory: 5060Mi
      pods: "14"
  startupTaints:
  - effect: NoSchedule
    key: zalando.org/node-not-ready
status:
  allocatable:
    cpu: 1800m
    ephemeral-storage: 89Gi
    memory: 7031Mi
    pods: "32"
    vpc.amazonaws.com/pod-eni: "9"
  capacity:
    cpu: "2"
    ephemeral-storage: 100Gi
    memory: 7577Mi
    pods: "32"
    vpc.amazonaws.com/pod-eni: "9"
  conditions:
  - lastTransitionTime: "2024-05-02T12:40:21Z"
    status: "True"
    type: Initialized
  - lastTransitionTime: "2024-05-02T12:38:49Z"
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-05-02T12:40:21Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-05-02T12:39:31Z"
    status: "True"
    type: Registered
  imageID: ******
  nodeName: ip-172-31-5-136.eu-central-1.compute.internal
  providerID: aws:///eu-central-1a/i-059805a98b7e75171

nodepool

piVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "4369624379001278596"
    karpenter.sh/nodepool-hash-version: v2
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"karpenter.sh/v1beta1","kind":"NodePool","metadata":{"annotations":{},"name":"default-karpenter"},"spec":{"disruption":{"consolidationPolicy":"WhenUnderutilized","expireAfter":"Never"},"template":{"metadata":{"labels":{"cluster-lifecycle-controller.zalan.do/replacement-strategy":"none","lifecycle-status":"ready","node.kubernetes.io/node-pool":"default-karpenter","node.kubernetes.io/profile":"worker-karpenter","node.kubernetes.io/role":"worker"}},"spec":{"kubelet":{"clusterDNS":["10.0.1.100"],"cpuCFSQuota":false,"kubeReserved":{"cpu":"100m","memory":"282Mi"},"maxPods":32,"systemReserved":{"cpu":"100m","memory":"164Mi"}},"nodeClassRef":{"name":"default-karpenter"},"requirements":[{"key":"karpenter.k8s.aws/instance-family","operator":"In","values":["c5","m5","r5","c5d","m5d","r5d","c5n","m5n","r5n","c6i","m6i","r6i","c6id","m6id","r6id","c6in","m6in","r6in"]},{"key":"karpenter.k8s.aws/instance-size","operator":"NotIn","values":["metal"]},{"key":"node.kubernetes.io/instance-type","operator":"NotIn","values":["c5d.large"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["spot","on-demand"]},{"key":"kubernetes.io/arch","operator":"In","values":["arm64","amd64"]},{"key":"topology.kubernetes.io/zone","operator":"In","values":["eu-central-1a","eu-central-1b","eu-central-1c"]}],"startupTaints":[{"effect":"NoSchedule","key":"zalando.org/node-not-ready"}]}}}}
  creationTimestamp: "2024-02-08T15:16:14Z"
  generation: 2
  name: default-karpenter
  resourceVersion: "2534926162"
  uid: 2536b136-fc71-40a9-a233-f51b81120e97
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidationPolicy: WhenUnderutilized
    expireAfter: Never
  template:
    metadata:
      labels:
        cluster-lifecycle-controller.zalan.do/replacement-strategy: none
        lifecycle-status: ready
        node.kubernetes.io/node-pool: default-karpenter
        node.kubernetes.io/profile: worker-karpenter
        node.kubernetes.io/role: worker
    spec:
      kubelet:
        clusterDNS:
        - 10.0.1.100
        cpuCFSQuota: false
        kubeReserved:
          cpu: 100m
          memory: 282Mi
        maxPods: 32
        systemReserved:
          cpu: 100m
          memory: 164Mi
      nodeClassRef:
        name: default-karpenter
      requirements:
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - c5
        - m5
        - r5
        - c5d
        - m5d
        - r5d
        - c5n
        - m5n
        - r5n
        - c6i
        - m6i
        - r6i
        - c6id
        - m6id
        - r6id
        - c6in
        - m6in
        - r6in
      - key: karpenter.k8s.aws/instance-size
        operator: NotIn
        values:
        - metal
      - key: node.kubernetes.io/instance-type
        operator: NotIn
        values:
        - c5d.large
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
        - amd64
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - eu-central-1a
        - eu-central-1b
        - eu-central-1c
      startupTaints:
      - effect: NoSchedule
        key: zalando.org/node-not-ready
status:
  resources:
    cpu: "102"
    ephemeral-storage: 2713582276Ki
    memory: 482793344Ki
    pods: "1980"

ec2NodeClass

apiVersion: v1
items:
- apiVersion: karpenter.k8s.aws/v1beta1
  kind: EC2NodeClass
  metadata:
    annotations:
      karpenter.k8s.aws/ec2nodeclass-hash: "2026609550328776800"
      karpenter.k8s.aws/ec2nodeclass-hash-version: v2
      kubectl.kubernetes.io/last-applied-configuration: {}
      creationTimestamp: "2024-02-08T15:16:14Z"
    finalizers:
    - karpenter.k8s.aws/termination
    generation: 7
    name: default-karpenter
    resourceVersion: "2516183961"
    uid: 2d7763a9-397e-4ffb-865f-92dfdaa1179e
  spec:
    amiFamily: Custom
    amiSelectorTerms:
    - id: ami-*****
    - id: ami-*****
    associatePublicIPAddress: true
    blockDeviceMappings:
    - deviceName: /dev/sda1
      ebs:
        deleteOnTermination: true
        volumeSize: 100Gi
        volumeType: gp3
    detailedMonitoring: false
    instanceProfile: .******
    metadataOptions:
      httpEndpoint: enabled
      httpProtocolIPv6: disabled
      httpPutResponseHopLimit: 2
      httpTokens: optional
    securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: WorkerNodeSecurityGroup
    subnetSelectorTerms:
    - tags:
        kubernetes.io/role/karpenter: enabled
    tags:
      InfrastructureComponent: "true"
      Name: default-karpenter
      application: kubernetes
      component: shared-resource
      environment: test
      node.kubernetes.io/node-pool: default-karpenter
      node.kubernetes.io/role: worker
      zalando.de/cluster-local-id/kube-1: owned
      zalando.org/pod-max-pids: "4096"
    userData: {.....}
  status: {}
kind: List
metadata:
  resourceVersion: ""
billrayburn commented 5 months ago

/assign @engedaam

jonathan-innis commented 5 months ago

@myaser Apologize for the late response on this one. Are you still seeing this issue?

myaser commented 5 months ago

@myaser Apologize for the late response on this one. Are you still seeing this issue?

yes, It is still happening on some of our clusters

engedaam commented 5 months ago

@myaser In the process of attempting to reproduce this issue, will update once we have more to share

myaser commented 4 months ago

I have a better understanding now for this issue, and here is how to reproduce it

we found a pod that uses invalid node affinity, the affinity was preferredDuringSchedulingIgnoredDuringExecution so it was ignored/relaxed by karpenter during initial scheduling. later on when the node nominated for the pod got consolidated, karpenter logged this error message. it eventually managed to replace the node, but it took much longer it seems like it did not relax/ignore the preferred affinity, also the error message was strange/misleading

example pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: testing-nginx
    owner: mgaballah
  name: testing-nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: testing-nginx
  template:
    metadata:
      labels:
        app: testing-nginx
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              preference:
                matchExpressions:
                - key: karpenter.sh/provisioner-name
                  operator: DoesNotExist
      containers:
      - image: nginx
        name: nginx
        resources: 
          limits:
            cpu: 200m
            memory: 50Mi
          requests:
            cpu: 200m
            memory: 50Mi

after it gets scheduled, try to consolidate the node by (for example) deleting the node object we fixed the pod, and the issue disappeared for us. with this understanding, I think this issue is lesser than a bug, but still I would be interested to understand few things

  1. why karpenter did not relax the nodeAffinity constraints
  2. the error message was misleading
engedaam commented 4 months ago

/triage accepted

dimitri-fert commented 4 months ago

Just encountered a similar problem. We have an EKS cluster deployed by Terraform with a NodeGroup of 1 node in which Karpenter v0.36 is installed and worked properly.

We recently added a soft nodeAffinity on a few pods to create a preference for the node managed by TF. As karpenter nodes already contains a few labels, we used a DoesNotExist operator on the karpenter.sh/nodepool-hash key and got errors similar to what OP had.

Initial Affinity we used ```yaml spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: karpenter.sh/nodepool-hash operator: DoesNotExist weight: 50 ``` associated Karpenter's controler logs ``` {"level":"INFO","time":"2024-06-03T13:20:40.941Z","logger":"controller.disruption","message":"triggering termination for expired node after TTL","commit":"6b868db","ttl":"1h0m0s"} {"level":"INFO","time":"2024-06-03T13:20:40.941Z","logger":"controller.disruption","message":"disrupting via expiration replace, terminating 1 nodes (2 pods) xxxxxxxxxxxxxxxxxxxx.compute.internal/t4g.small/on-demand and replacing with on-demand node from types t4g.small, t3a.small, t3.small, t4g.medium, t3a.medium and 34 other(s)","commit":"6b868db","command-id":"58d486d2-012b-4f3e-ad32-f46f6e82d449"} {"level":"ERROR","time":"2024-06-03T13:20:41.170Z","logger":"controller.disruption","message":"disrupting via \"expiration\", disrupting candidates, launching replacement nodeclaim (command-id: 58d486d2-012b-4f3e-ad32-f46f6e82d449), creating node claim, NodeClaim.karpenter.sh \"default-5g9xb\" is invalid: spec.requirements[4].key: Invalid value: \"string\": label domain \"karpenter.sh\" is restricted","commit":"6b868db"} {"level":"INFO","time":"2024-06-03T13:21:14.244Z","logger":"controller.disruption","message":"triggering termination for expired node after TTL","commit":"6b868db","ttl":"1h0m0s"} {"level":"INFO","time":"2024-06-03T13:21:14.244Z","logger":"controller.disruption","message":"disrupting via expiration replace, terminating 1 nodes (2 pods) xxxxxxxxxxxxxxxx.compute.internal/t4g.small/on-demand and replacing with on-demand node from types t4g.small, t3a.small, t3.small, t4g.medium, t3a.medium and 34 other(s)","commit":"6b868db","command-id":"7827779a-3b0c-4c65-83b0-8d427de328be"} {"level":"ERROR","time":"2024-06-03T13:21:14.462Z","logger":"controller.disruption","message":"disrupting via \"expiration\", disrupting candidates, launching replacement nodeclaim (command-id: 7827779a-3b0c-4c65-83b0-8d427de328be), creating node claim, NodeClaim.karpenter.sh \"default-785zb\" is invalid: spec.requirements[1].key: Invalid value: \"string\": label domain \"karpenter.sh\" is restricted","commit":"6b868db"} ``` N.B. : sensitive information removed

Later on I questionned the key we used realising that karpenter.sh/nodepool-hash is an annotation key and not a label key. So I switched to karpenter.sh/nodepool and it seemed to have solved the problem.

Karpenter's controller last log before applying the patched nodeAffinity ``` {"level":"ERROR","time":"2024-06-04T10:29:23.741Z","logger":"controller.disruption","message":"disrupting via \"expiration\", disrupting candidates, launching replacement nodeclaim (command-id: cf21d12e-5e6c-417f-92d2-482bf9c78042), creating node claim, NodeClaim.karpenter.sh \"default-xznbg\" is invalid: spec.requirements[2].key: Invalid value: \"string\": label domain \"karpenter.k8s.aws\" is restricted","commit":"6b868db"} ``` Post-apply logs ``` {"level":"INFO","time":"2024-06-04T10:46:04.246Z","logger":"controller.disruption","message":"triggering termination for expired node after TTL","commit":"6b868db","ttl":"1h0m0s"} {"level":"INFO","time":"2024-06-04T10:46:04.248Z","logger":"controller.disruption","message":"disrupting via expiration replace, terminating 1 nodes (2 pods) xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal/t4g.small/on-demand and replacing with on-demand node from types t4g.small, t3a.small, t3.small, t4g.medium, t3a.medium and 34 other(s)","commit":"6b868db","command-id":"7c2ae915-8210-4df1-80a6-3462a95c16c8"} {"level":"INFO","time":"2024-06-04T10:46:04.482Z","logger":"controller.disruption","message":"created nodeclaim","commit":"6b868db","nodepool":"default","nodeclaim":"default-pc2xq","requests":{"cpu":"1220m","memory":"690Mi","pods":"6"},"instance-types":"c5.large, c5.xlarge, c5a.large, c5a.xlarge, c5d.large and 34 other(s)"} {"level":"INFO","time":"2024-06-04T10:46:07.114Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"6b868db","nodeclaim":"default-pc2xq","provider-id":"aws:///xxxxxxxxxx/i-0d9453af78ad7983e","instance-type":"t4g.small","zone":"xxxxxxxxxx","capacity-type":"on-demand","allocatable":{"cpu":"1930m","ephemeral-storage":"17Gi","memory":"1359Mi","pods":"32"}} {"level":"INFO","time":"2024-06-04T10:46:15.644Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"6b868db","pods":"monitoring/prometheus-prometheus-kube-prometheus-prometheus-0, kube-system/coredns-dfd64456d-756fw","duration":"196.715971ms"} {"level":"INFO","time":"2024-06-04T10:46:25.643Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"6b868db","pods":"monitoring/prometheus-prometheus-kube-prometheus-prometheus-0, kube-system/coredns-dfd64456d-756fw","duration":"194.918258ms"} {"level":"INFO","time":"2024-06-04T10:46:29.844Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"6b868db","nodeclaim":"default-pc2xq","provider-id":"aws:///xxxxxxxxxx/i-0d9453af78ad7983e","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal"} {"level":"INFO","time":"2024-06-04T10:46:35.742Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"6b868db","pods":"monitoring/prometheus-prometheus-kube-prometheus-prometheus-0, kube-system/coredns-dfd64456d-756fw","duration":"293.341699ms"} {"level":"INFO","time":"2024-06-04T10:46:39.474Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"6b868db","nodeclaim":"default-pc2xq","provider-id":"aws:///xxxxxxxxxx/i-0d9453af78ad7983e","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal","allocatable":{"cpu":"1930m","ephemeral-storage":"18233774458","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"1408504Ki","pods":"32"}} {"level":"INFO","time":"2024-06-04T10:46:41.472Z","logger":"controller.disruption.queue","message":"command succeeded","commit":"6b868db","command-id":"7c2ae915-8210-4df1-80a6-3462a95c16c8"} {"level":"INFO","time":"2024-06-04T10:46:41.567Z","logger":"controller.node.termination","message":"tainted node","commit":"6b868db","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal"} {"level":"INFO","time":"2024-06-04T10:46:43.242Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"6b868db","pods":"monitoring/prometheus-prometheus-kube-prometheus-prometheus-0, kube-system/coredns-dfd64456d-756fw","duration":"590.049476ms"} {"level":"INFO","time":"2024-06-04T10:46:49.190Z","logger":"controller.node.termination","message":"deleted node","commit":"6b868db","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal"} {"level":"INFO","time":"2024-06-04T10:46:49.685Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"6b868db","nodeclaim":"default-7f7lf","node":"xxxxxxxxxxxxxxxxxxxxxxxx.compute.internal","provider-id":"aws:///xxxxxxxxxx/i-0a09fedd4e0233a92"} ``` N.B. : sensitive information removed

I would agree too that the error log can be missleading Hope it could help someone in need for help :)