Drifted NodeClaims with a TerminationGracePeriod are not considered for disruption

wmgroot commented 1 month ago

Description

Observed Behavior: NodeClaims enter a Drifted state, but fail to become disrupted.

$ kubectl get nodeclaim worker-test-disruption-xz5jd -o yaml
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
  annotations:
    compatibility.karpenter.k8s.aws/cluster-name-tagged: "true"
    compatibility.karpenter.k8s.aws/kubelet-drift-hash: "460580454793782748"
    karpenter.k8s.aws/ec2nodeclass-hash: "4188936271944326608"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
    karpenter.k8s.aws/tagged: "true"
    karpenter.sh/nodepool-hash: "12987612610332357418"
    karpenter.sh/nodepool-hash-version: v3
    test: "1"
  creationTimestamp: "2024-09-20T20:50:38Z"
  finalizers:
  - karpenter.sh/termination
  generateName: worker-test-disruption-
  generation: 1
  labels:
    apps.indeed.com/karpenter: "true"
    apps.indeed.com/karpenter-node-pool: worker-test-disruption
    apps.indeed.com/network-zone: qa
    foo: barrrrrr
    karpenter.k8s.aws/instance-category: c
    karpenter.k8s.aws/instance-cpu: "8"
    karpenter.k8s.aws/instance-cpu-manufacturer: aws
    karpenter.k8s.aws/instance-ebs-bandwidth: "4750"
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "false"
    karpenter.k8s.aws/instance-family: c6g
    karpenter.k8s.aws/instance-generation: "6"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "16384"
    karpenter.k8s.aws/instance-network-bandwidth: "2500"
    karpenter.k8s.aws/instance-size: 2xlarge
    karpenter.sh/capacity-type: on-demand
    karpenter.sh/nodepool: worker-test-disruption
    kubernetes.io/arch: arm64
    kubernetes.io/os: linux
    node.indeed.com/env: infra
    node.kubernetes.io/instance-type: c6g.2xlarge
    role.node.kubernetes.io/worker: "true"
    test-disruption: "true"
    topology.istio.io/subzone: awscmhdev3
    topology.k8s.aws/zone-id: use2-az2
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2b
  name: worker-test-disruption-xz5jd
  ownerReferences:
  - apiVersion: karpenter.sh/v1
    blockOwnerDeletion: true
    kind: NodePool
    name: worker-test-disruption
    uid: b401cf94-ad8d-4341-9eb2-5cf97cee2d46
  resourceVersion: "871675110"
  uid: b207ad83-829a-44b2-99c0-c636dac60ad0
spec:
  expireAfter: 720h
  nodeClassRef:
    group: karpenter.k8s.aws
    kind: EC2NodeClass
    name: worker-test-disruption
  requirements:
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: foo
    operator: In
    values:
    - barrrrrr
  - key: apps.indeed.com/karpenter-node-pool
    operator: In
    values:
    - worker-test-disruption
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: karpenter.k8s.aws/instance-category
    operator: In
    values:
    - c
    - m
    - r
  - key: role.node.kubernetes.io/worker
    operator: In
    values:
    - "true"
  - key: karpenter.k8s.aws/instance-cpu
    operator: Gt
    values:
    - "7"
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - worker-test-disruption
  - key: karpenter.k8s.aws/instance-generation
    operator: Gt
    values:
    - "2"
  - key: apps.indeed.com/network-zone
    operator: In
    values:
    - qa
  - key: apps.indeed.com/karpenter
    operator: In
    values:
    - "true"
  - key: node.indeed.com/env
    operator: In
    values:
    - infra
  - key: test-disruption
    operator: In
    values:
    - "true"
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
    - arm64
  - key: topology.istio.io/subzone
    operator: In
    values:
    - awscmhdev3
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c4.2xlarge
    - c5.2xlarge
    - c5a.2xlarge
    - c5a.4xlarge
    - c5ad.2xlarge
    - c5d.2xlarge
    - c5n.2xlarge
    - c6a.2xlarge
    - c6a.4xlarge
    - c6g.2xlarge
    - c6g.4xlarge
    - c6gd.2xlarge
    - c6gd.4xlarge
    - c6gn.2xlarge
    - c6i.2xlarge
    - c6id.2xlarge
    - c6in.2xlarge
    - c7a.2xlarge
    - c7g.2xlarge
    - c7g.4xlarge
    - c7gd.2xlarge
    - c7gn.2xlarge
    - c7i-flex.2xlarge
    - c7i.2xlarge
    - m4.2xlarge
    - m5.2xlarge
    - m5a.2xlarge
    - m5ad.2xlarge
    - m5d.2xlarge
    - m5dn.2xlarge
    - m5n.2xlarge
    - m6a.2xlarge
    - m6g.2xlarge
    - m6g.4xlarge
    - m6gd.2xlarge
    - m6i.2xlarge
    - m6id.2xlarge
    - m6in.2xlarge
    - m7a.2xlarge
    - m7g.2xlarge
    - m7gd.2xlarge
    - m7i-flex.2xlarge
    - m7i.2xlarge
    - r4.2xlarge
    - r5.2xlarge
    - r5a.2xlarge
    - r5ad.2xlarge
    - r5b.2xlarge
    - r5d.2xlarge
    - r5n.2xlarge
    - r6a.2xlarge
    - r6g.2xlarge
    - r6gd.2xlarge
    - r6i.2xlarge
    - r6id.2xlarge
    - r7a.2xlarge
    - r7g.2xlarge
    - r7gd.2xlarge
    - r7i.2xlarge
    - r8g.2xlarge
  resources:
    requests:
      cpu: 3157m
      ephemeral-storage: 1752Mi
      memory: "6612603620"
      pods: "15"
  startupTaints:
  - effect: NoSchedule
    key: node.cilium.io/agent-not-ready
  taints:
  - effect: NoSchedule
    key: test-disruption
    value: "true"
  terminationGracePeriod: 24h0m0s
status:
  allocatable:
    cpu: 7700m
    ephemeral-storage: 213504Mi
    memory: "14828122931"
    pods: "43"
    vpc.amazonaws.com/pod-eni: "38"
  capacity:
    cpu: "8"
    ephemeral-storage: 250Gi
    memory: 15096Mi
    pods: "43"
    vpc.amazonaws.com/pod-eni: "38"
  conditions:
  - lastTransitionTime: "2024-09-20T21:01:09Z"
    message: ""
    reason: ConsistentStateFound
    status: "True"
    type: ConsistentStateFound
  - lastTransitionTime: "2024-09-20T20:51:32Z"
    message: ""
    reason: Consolidatable
    status: "True"
    type: Consolidatable
  - lastTransitionTime: "2024-09-20T21:02:18Z"
    message: NodePoolDrifted
    reason: NodePoolDrifted
    status: "True"
    type: Drifted
  - lastTransitionTime: "2024-09-20T20:51:32Z"
    message: ""
    reason: Initialized
    status: "True"
    type: Initialized
  - lastTransitionTime: "2024-09-20T20:50:40Z"
    message: ""
    reason: Launched
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-09-20T20:51:32Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-09-20T20:51:01Z"
    message: ""
    reason: Registered
    status: "True"
    type: Registered
  imageID: ami-06ab2496c2a927ca6
  nodeName: ip-10-115-210-142.us-east-2.compute.internal
  providerID: aws:///us-east-2b/i-0625839770e0cf9bf

No volumeattachments are involved with this issue.

$ kubectl get volumeattachments | grep ip-10-115-210-142.us-east-2.compute.internal
<nothing>

My node says that disruption is blocked due to a pending pod, but I have no pending pods in my cluster, and the node in question has a taint to allow only a single do-not-disrupt pod to schedule there as a test case.

$ kubectl describe node ip-10-115-210-142.us-east-2.compute.internal

Taints:             test-disruption=true:NoSchedule
Unschedulable:      false

Events:
  Type     Reason                   Age                   From                   Message
  ----     ------                   ----                  ----                   -------
  Normal   NodeHasSufficientMemory  47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeAllocatableEnforced  47m                   kubelet                Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientPID     47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeHasNoDiskPressure    47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal   Starting                 47m                   kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      47m                   kubelet                invalid capacity 0 on image filesystem
  Normal   RegisteredNode           47m                   node-controller        Node ip-10-115-210-142.us-east-2.compute.internal event: Registered Node ip-10-115-210-142.us-east-2.compute.internal in Controller
  Normal   Synced                   47m                   cloud-node-controller  Node synced successfully
  Normal   DisruptionBlocked        47m                   karpenter              Cannot disrupt Node: state node isn't initialized
  Normal   NodeReady                47m                   kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeReady
  Normal   DisruptionBlocked        39m (x4 over 45m)     karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        37m                   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        30m (x4 over 36m)     karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        29m                   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        4m58s (x13 over 29m)  karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        45s (x2 over 2m46s)   karpenter              Cannot disrupt Node: state node is nominated for a pending pod

$ kubectl get pod -n default -o wide
NAME                                 READY   STATUS    RESTARTS   AGE   IP              NODE                                           NOMINATED NODE   READINESS GATES
hello-world-nginx-5964768b4c-fnrxp   1/1     Running   0          77m   10.75.173.246   ip-10-115-217-230.us-east-2.compute.internal   <none>           <none>

$ kubectl get pod -n default hello-world-nginx-5964768b4c-fnrxp -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    karpenter.sh/do-not-disrupt: "true"
  creationTimestamp: "2024-09-20T20:49:39Z"
  generateName: hello-world-nginx-5964768b4c-
  labels:
    app: hello-world-nginx
    pod-template-hash: 5964768b4c
  name: hello-world-nginx-5964768b4c-fnrxp
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: hello-world-nginx-5964768b4c
    uid: 19b2047d-90ab-494d-ac65-94f2ad42971d
  resourceVersion: "871647910"
  uid: 4748ed5b-1cb5-4c00-a5f7-996e73c5ce97
spec:
  containers:
  - image: docker-registry.awscmhdev3.k8s.indeed.tech/dockerhub-proxy/nginxinc/nginx-unprivileged:1.23-alpine
    imagePullPolicy: IfNotPresent
    name: hello-world-nginx
    ports:
    - containerPort: 8080
      protocol: TCP
    resources:
      requests:
        cpu: "2"
        memory: 4Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/nginx
      name: nginx-conf
      readOnly: true
    - mountPath: /var/log/nginx
      name: log
    - mountPath: /var/cache/nginx
      name: cache
    - mountPath: /var/run
      name: run
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5mthz
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-115-217-230.us-east-2.compute.internal
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
    test-disruption: "true"
  preemptionPolicy: Never
  priority: 100
  priorityClassName: tier-3
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000
    runAsGroup: 1000
    runAsUser: 1000
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: test-disruption
    value: "true"
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      items:
      - key: nginx.conf
        path: nginx.conf
      name: hello-world-nginx
    name: nginx-conf
  - emptyDir: {}
    name: log
  - emptyDir: {}
    name: cache
  - emptyDir: {}
    name: run
  - name: kube-api-access-5mthz
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

Expected Behavior: Nodes with a TerminationGracePeriod set that include do-not-disrupt or PDB-blocked pods are able to be disrupted due to NodeClaim drift and are eventually drained. A new NodeClaim is created immediately once disruption of the old NodeClaim begins.

Reproduction Steps (Please include YAML):

Versions: 1.0.1

Chart Version: 1.0.1

Kubernetes Version (kubectl version): 1.29

$ kubectl version
Client Version: v1.25.1
Kustomize Version: v4.5.7
Server Version: v1.29.7-eks-2f46c53

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

k8s-ci-robot commented 1 month ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

wmgroot commented 1 month ago

For anyone experiencing similar symptoms without TerminationGracePeriod, check to see if you might be encountering a problem with volumemountattachments described in this issue. https://github.com/kubernetes-sigs/karpenter/issues/1684

njtran commented 1 month ago

Adding correspondence from slack:

Two potential influencing factors here:

We wait for full instance termination before removing a node at v1. This impacts NodePool Disruption Budgets
Karpenter includes pods on deleting nodes in scheduling simulations, so pods that are do-not-disrupt on a node that is deleting are including in as scheduling simulations, assuming that the node will be cleaned up soon. Karpenter could think that these pods are "rescheduling" to a drifted node, and thus not consider the node a candidate for disruption.

I'm guessing you're more impacted by #2 based on the events I see. This is likely a difference from v0.37 since we didn't have TGP and didn't enqueue nodes for deletion that had do-not-disrupt/pdb blocking pods in the first place, making the likelihood that you had indefinitely draining nodes higher. Not to mention that we also now block eviction on do-not-disrupt so the average drain time might be higher than in v0.37

If I had to guess, the best way to fix this would be for us to solve our preferences story (https://github.com/kubernetes-sigs/karpenter/issues/666), and add a PreferNoSchedule taint for drifted nodes. We discussed adding that taint, but we didn't have a consistent story around how both consolidation/disruption/provisioning could all align so we don't get any flapping issues.

njtran commented 1 month ago

More correspondence in slack: (of which i know some is included here already) Yeah we should definitely dive into it, once again these were just theories. Can you open the issue with the logs/events and check which pods are being considered here? There should be pod nomination events and node nomination events. It'd be interesting to see the following:

if the pod is scheduled to a node that is draining
if it's the same pod that is blocking another node from being drifted
if the pod has a do-not-disrupt or pdb

njtran commented 1 month ago

@wmgroot any thoughts here? did you get a chance to validate what I was saying?

wmgroot commented 1 month ago

@cnmcavoy identified a bug in Karpenter's logic that tracks nodes marked for deletion. There's an error case which can fail to unmark a marked node, resulting in disruption budgets being reached while no progress can be made. We've got a patch that we've been testing for the last week and plan to open a PR for soon.

We think that TGP is not directly related to this problem, but was exacerbating the issue since nodes in a terminating state take up space in the disruption budget while they're pending termination.

wmgroot commented 1 month ago

Ultimately the bug was introduced in some of our patched code to address issues with single and multi-node consolidation. We plan to work further with the maintainers on improvements to consolidation to avoid the need to run a forked version of Karpenter.

After addressing the bug in our patch, we have seen our disruption frequencies and cluster scale return to pre-v1 levels. We'll re-open or create a new issue if we notice anything else amiss with TGP and drift disruption.

kubernetes-sigs / karpenter

Drifted NodeClaims with a TerminationGracePeriod are not considered for disruption #1702

Description