kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
615 stars 204 forks source link

SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 6 #1653

Open dmitry-mightydevops opened 1 month ago

dmitry-mightydevops commented 1 month ago

Description

Observed Behavior:

Karpented scaled nodes as a result of KEDA/HPA commands. Finally it resulted in a single beefy with no load and throws this Unconsolidatable reason.

This is my node: karpenter.k8s.aws/instance-size=2xlarge

Name:               ip-10-120-207-50.ec2.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=t3.2xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1c
                    k8s.io/cloud-provider-aws=6858d20c77336487e788df818ac91521
                    karpenter.k8s.aws/instance-category=t
                    karpenter.k8s.aws/instance-cpu=8
                    karpenter.k8s.aws/instance-cpu-manufacturer=intel
                    karpenter.k8s.aws/instance-ebs-bandwidth=2780
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=false
                    karpenter.k8s.aws/instance-family=t3
                    karpenter.k8s.aws/instance-generation=3
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-memory=32768
                    karpenter.k8s.aws/instance-network-bandwidth=2048
                    karpenter.k8s.aws/instance-size=2xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/nodepool=celery-worker-import-export
                    karpenter.sh/registered=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-120-207-50.ec2.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=t3.2xlarge
                    topology.ebs.csi.aws.com/zone=us-east-1c
                    topology.k8s.aws/zone-id=use1-az2
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1c
                    workload=celery-worker-import-export
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.120.207.50
                    compatibility.karpenter.k8s.aws/kubelet-drift-hash: 15379597991425564585
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0b94841274d002f7d"}
                    karpenter.k8s.aws/ec2nodeclass-hash: 970621003941211566
                    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
                    karpenter.sh/nodepool-hash: 11848879526719149474
                    karpenter.sh/nodepool-hash-version: v3
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-atta

CreationTimestamp:  Mon, 09 Sep 2024 19:47:49 -0500
Taints:             workload=celery-worker-import-export:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-120-207-50.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Tue, 10 Sep 2024 16:14:30 -0500
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 09 Sep 2024 19:48:26 -0500   Mon, 09 Sep 2024 19:48:26 -0500   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Tue, 10 Sep 2024 16:11:41 -0500   Mon, 09 Sep 2024 19:47:44 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 10 Sep 2024 16:11:41 -0500   Mon, 09 Sep 2024 19:47:44 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 10 Sep 2024 16:11:41 -0500   Mon, 09 Sep 2024 19:47:44 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 10 Sep 2024 16:11:41 -0500   Mon, 09 Sep 2024 19:48:12 -0500   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.120.207.50
  InternalDNS:  ip-10-120-207-50.ec2.internal
  Hostname:     ip-10-120-207-50.ec2.internal
Capacity:
  cpu:                8
  ephemeral-storage:  52350956Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32475452Ki
  pods:               58
Allocatable:
  cpu:                7910m
  ephemeral-storage:  47172899146
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             31458620Ki
  pods:               58
System Info:
  Machine ID:                 ec266587ac6140721be9fcafb3b2cfb5
  System UUID:                ec266587-ac61-4072-1be9-fcafb3b2cfb5
  Boot ID:                    82d2dda9-6416-414d-b77d-860253d40ddf
  Kernel Version:             6.1.102-111.182.amzn2023.x86_64
  OS Image:                   Amazon Linux 2023.5.20240819
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.11
  Kubelet Version:            v1.30.2-eks-1552ad0
  Kube-Proxy Version:         v1.30.2-eks-1552ad0
ProviderID:                   aws:///us-east-1c/i-0b94841274d002f7d
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                           CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                           ------------  ----------  ---------------  -------------  ---
  cilium                      cilium-ggnv4                                   100m (1%)     0 (0%)      100Mi (0%)       0 (0%)         20h
  kube-system                 aws-node-lc6nq                                 50m (0%)      0 (0%)      0 (0%)           0 (0%)         20h
  kube-system                 ebs-csi-node-x6njm                             30m (0%)      0 (0%)      120Mi (0%)       768Mi (2%)     20h
  kube-system                 kube-proxy-kffc8                               100m (1%)     0 (0%)      0 (0%)           0 (0%)         20h
  loki                        promtail-8pb5t                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         20h
  prod                        celery-worker-import-export-b86c65867-hlcwp    900m (11%)    2 (25%)     1Gi (3%)         4Gi (13%)      16h
  prod                        celery-worker-import-export-b86c65867-txzmq    900m (11%)    2 (25%)     1Gi (3%)         4Gi (13%)      16h
  prometheus                  node-exporter-bltb2                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         20h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                2080m (26%)  4 (50%)
  memory             2268Mi (7%)  8960Mi (29%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
Events:
  Type    Reason            Age                   From       Message
  ----    ------            ----                  ----       -------
  Normal  Unconsolidatable  8m23s (x33 over 19h)  karpenter  SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 6

Expected Behavior:

Node to be replaced with cheaper one.

Reproduction Steps (Please include YAML):

class:

➜ kg ec2nodeclass.karpenter.k8s.aws/celery-worker-import-export -o yaml | k neat

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "970621003941211566"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: celery-worker-import-export
    argocd.argoproj.io/instance: backend
    helm.sh/chart: backend-django-extra-0.1.0
  name: celery-worker-import-export
spec:
  amiSelectorTerms:
  - alias: al2023@latest
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      encrypted: true
      volumeSize: 50Gi
      volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: project-prod-eks-karpenter-node-role
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: project-prod-eks
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: project-prod-eks
  tags:
    Name: prod-apps-celery-worker-import-export-karpenter
    environment: prod
    explanation: celery worker import export dedicated nodes provisioned by karpenter
      for the backend extra component
    karpenter: "true"
    karpenter.sh/cluster/project-prod-eks: owned
    karpenter.sh/discovery: project-prod-eks
    managed_by: karpenter
    ops_team: devops
    terraform: "false"
  userData: |
    #!/bin/bash
    # https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes
    # https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1955
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    echo "$(jq ".shutdownGracePeriod=\"400s\"" /etc/kubernetes/kubelet/config.json)" > /etc/kubernetes/kubelet/config.json
    echo "$(jq ".shutdownGracePeriodCriticalPods=\"100s\"" /etc/kubernetes/kubelet/config.json)" > /etc/kubernetes/kubelet/config.json
    systemctl restart kubelet

pool:

➜ kg nodepool celery-worker-import-export -o yaml | k neat

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "11848879526719149474"
    karpenter.sh/nodepool-hash-version: v3
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: celery-worker-import-export
    argocd.argoproj.io/instance: backend
    helm.sh/chart: backend-django-extra-0.1.0
  name: celery-worker-import-export
spec:
  disruption:
    budgets:
    - nodes: 50%
      reasons:
      - Empty
      - Drifted
      - Underutilized
    - duration: 10h
      nodes: "0"
      reasons:
      - Underutilized
      schedule: 0 9 * * mon-fri
    consolidateAfter: 1m
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: 32
    memory: 128Gi
  template:
    metadata:
      labels:
        workload: celery-worker-import-export
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: celery-worker-import-export
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - c
        - m
        - t
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "2"
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - c5d
        - c7i
        - m5
        - m6i
        - m7i
        - r6
        - r7
        - t3
      - key: karpenter.k8s.aws/instance-cpu
        operator: In
        values:
        - "2"
        - "4"
        - "8"
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - us-east-1a
        - us-east-1b
        - us-east-1c
      startupTaints:
      - effect: NoExecute
        key: node.cilium.io/agent-not-ready
      taints:
      - effect: NoSchedule
        key: workload
        value: celery-worker-import-export

when I deleted the node:

I got the following in my karpenter logs:

{"level":"INFO","time":"2024-09-10T21:29:14.212Z","logger":"controller","message":"created nodeclaim","commit":"62a726c","controller":"provisioner","namespace":"","name":"","reconcileID":"c52506f6-373d-4e3e-b8d9-8a0353bec7dd","NodePool":{"name":"celery-worker-import-export"},"NodeClaim":{"name":"celery-worker-import-export-lwkbn"},"requests":{"cpu":"2080m","memory":"2268Mi","pods":"8"},"instance-types":"c5d.2xlarge, c5d.xlarge, c7i.2xlarge, c7i.xlarge, m5.2xlarge and 7 other(s)"}

{"level":"INFO","time":"2024-09-10T21:29:16.310Z","logger":"controller","message":"launched nodeclaim","commit":"62a726c","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"celery-worker-import-export-lwkbn"},"namespace":"","name":"celery-worker-import-export-lwkbn","reconcileID":"871984fc-8cf6-42fa-a517-14ce142519f0","provider-id":"aws:///us-east-1b/i-073e927ea2656b7b8","instance-type":"t3.xlarge","zone":"us-east-1b","capacity-type":"spot","allocatable":{"cpu":"3920m","ephemeral-storage":"44Gi","memory":"14162Mi","pods":"58"}}
{"level":"INFO","time":"2024-09-10T21:29:48.223Z","logger":"controller","message":"registered nodeclaim","commit":"62a726c","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"celery-worker-import-export-lwkbn"},"namespace":"","name":"celery-worker-import-export-lwkbn","reconcileID":"a128cdc2-08d0-41ee-9f23-21cf16700c57","provider-id":"aws:///us-east-1b/i-073e927ea2656b7b8","Node":{"name":"ip-10-120-162-178.ec2.internal"}}

so it allocated a proper node t3.xlarge instead of t3.2xlarge I had before.

Versions:

leoryu commented 1 month ago

Maybe duplicate with https://github.com/kubernetes-sigs/karpenter/issues/1645

jonathan-innis commented 1 month ago

Responded here but this is expected with spot-to-spot consolidation. The CloudProvider takes in all of the spot possibilities on initial launch and is going to give you back the instance type that it thinks is optimal at that point in time. If you are using AWS, this looks like CreateFleet determining that this is the instance type that has the best cost/availability combination and therefore they launch that instance type for you. That may not be the cheapest instance type -- which is what you are seeing here. In this case, you're seeing, though, that it's close to the bottom.

Fleet didn't choose those bottom 6 instance types because they most likely had a much higher chance of being interrupted than the instance type that they placed you in.

jonathan-innis commented 1 month ago

/triage accepted