PDBs still block "forceful" node termination

dpiddock commented 4 weeks ago

Description

Observed Behavior: After upgrading to Karpenter 1.0, we tried to enact a policy to terminate nodes after 7d with a 4h terminationGracePeriod. However, Karpenter still refuses to terminate a pod at the deadline if the PDB does not allow for disruption. This results in us having large instances running with just a single workload pod as Karpenter has already evicted other workloads and tainted the node karpenter.sh/disrupted:NoSchedule 💸 .

Repeated events are generated against the node:

  Normal   DisruptionBlocked  14m (x1329 over 47h)    karpenter  Cannot disrupt Node: state node is marked for deletion
  Warning  FailedDraining     3m48s (x1407 over 47h)  karpenter  Failed to drain node, 12 pods are waiting to be evicted

11 DaemonSet pods and 1 pod from a Deployment. The Deployment's PDB is configured to not allow normal termination of the pod.

Karpenter itself is logging:

{
  "level": "ERROR",
  "time": "2024-10-25T12:46:56.452Z",
  "logger": "controller",
  "message": "consistency error",
  "commit": "6174c75",
  "controller": "nodeclaim.consistency",
  "controllerGroup": "karpenter.sh",
  "controllerKind": "NodeClaim",
  "NodeClaim": {
    "name": "test-vxtgb"
  },
  "namespace": "",
  "name": "test-vxtgb",
  "reconcileID": "2a7b8ffd-80cf-4fbf-b612-870a33adec27",
  "error": "can't drain node, PDB \"default/test\" is blocking evictions"
}

Expected Behavior: A node owned by Karpenter reaches expireAfter + terminationGracePeriod and all pods are removed. Node is terminated.

I'm not sure if this is actually a documentation bug? But the documentation certainly implies, to my reading, that PDBs get overridden when the grace period expires: terminationGracePeriod

Pods blocking eviction like PDBs and do-not-disrupt will block full draining until the terminationGracePeriod is reached.

Reproduction Steps (Please include YAML):

Have a NodePool with forceful termination enabled. e.g.

spec:
template:
  spec:
    expireAfter: 1h
    terminationGracePeriod: 1h

Create a Deployment:

kubectl create deployment test --image=nginx --replicas=1

Add a PDB that won't allow termination:

kubectl create poddisruptionbudget test --selector=app=test --min-available=1

Wait. Node won't get terminated by Karpenter

Versions:

Chart Version: 1.0.6

Kubernetes Version (kubectl version):

Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.30.4-eks-a737599

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

k8s-ci-robot commented 4 weeks ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

engedaam commented 2 weeks ago

Can you share your karpenter configuration application?

dpiddock commented 2 weeks ago

We install Karpenter with Helm:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/karpenter-workload
          operator: Exists
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - topologyKey: kubernetes.io/hostname
controller:
  resources:
    limits:
      memory: 1Gi
    requests:
      cpu: 0.25
      memory: 1Gi
dnsPolicy: Default
logLevel: info
podAnnotations:
  prometheus.io/port: "8080"
  prometheus.io/scrape: "true"
podDisruptionBudget:
  maxUnavailable: 1
  name: karpenter
priorityClassName: system-cluster-critical
serviceAccount:
  create: false
  name: karpenter-controller
settings:
  clusterEndpoint: https://[...].eks.amazonaws.com
  clusterName: application-cluster
  interruptionQueue: application-cluster-karpenter-interruption-handler
strategy:
  rollingUpdate:
    maxUnavailable: 1
tolerations:
- effect: NoSchedule
  key: node.kubernetes.io/workload
  operator: Equal
  value: karpenter
topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule

A sample EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: mixed
spec:
  amiFamily: AL2
  amiSelectorTerms:
  - id: ami-1 # amazon-eks-node-1.30-*
  - id: ami-2 # amazon-eks-arm64-node-1.30-*
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      encrypted: true
      volumeSize: 128Gi
      volumeType: gp3
  detailedMonitoring: true
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: application-cluster-node
  securityGroupSelectorTerms:
  - id: sg-1
  - id: sg-2
  subnetSelectorTerms:
  - id: subnet-a
  - id: subnet-b
  - id: subnet-c
  tags:
    Edition: mixed
    karpenter.sh/discovery: application-cluster
  userData: |
    #!/bin/bash
    KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json
    grep -v search /etc/resolv.conf > /etc/kubernetes/kubelet/resolv.conf
    echo "$(jq '.resolvConf="/etc/kubernetes/kubelet/resolv.conf"' $KUBELET_CONFIG)" > $KUBELET_CONFIG
    echo "$(jq '.registryPullQPS=10' $KUBELET_CONFIG)" >  $KUBELET_CONFIG
    echo "$(jq '.registryBurst=25' $KUBELET_CONFIG)" >  $KUBELET_CONFIG

And the matching NodePool:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: mixed
spec:
  disruption:
    budgets:
    - nodes: 10%
    - nodes: "0"
      reasons:
      - Drifted
    consolidateAfter: 0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: "1500"
  template:
    spec:
      expireAfter: 168h # 1 week
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: mixed
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - c 
        - m 
        - r 
      - key: karpenter.k8s.aws/instance-cpu
        operator: In
        values:
        - "8" 
        - "16"
        - "32"
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "4" 
      - key: karpenter.k8s.aws/instance-hypervisor
        operator: In
        values:
        - nitro
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - us-east-1a
        - us-east-1b
        - us-east-1c
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
        - spot
      startupTaints:
      - effect: NoExecute
        key: ebs.csi.aws.com/agent-not-ready
      - effect: NoExecute
        key: efs.csi.aws.com/agent-not-ready
      terminationGracePeriod: 4h
  weight: 50

PavelGloba commented 2 days ago

If this feature will work properly, we are going to migrate to karpenter

kubernetes-sigs / karpenter

PDBs still block "forceful" node termination #1776

Description