kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
568 stars 188 forks source link

Karpenter takes sub-optimal decisions from a pricing point of view #1149

Open shubhamrai1993 opened 6 months ago

shubhamrai1993 commented 6 months ago

Description

Observed Behavior: I scheduled a deployment with two pods requiring single a10g GPUs (g5 machines) on AWS in a single zone. Karpenter created one g5.12xlarge with 4 A10G GPUs

Expected Behavior: Karpenter should have created two separate machines with 1 a10g GPU cards, so 2 separate g5.2xlarge machines looking at the cost difference which comes to around $2338.56 per month for this case.

Reproduction Steps (Please include YAML): deployment.yaml -

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-g5
spec:
  replicas: 0
  selector:
    matchLabels:
      truefoundry.com/component: test-g5
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - eu-west-1a
                  - key: karpenter.k8s.aws/instance-family
                    operator: In
                    values:
                      - g5
                  - key: karpenter.sh/capacity-type
                    operator: In
                    values:
                      - on-demand
      containers:
          image: testcontainers/helloworld
          imagePullPolicy: IfNotPresent
          name: test-g5
          resources:
            limits:
              cpu: 8
              ephemeral-storage: 100000M
              memory: 30000M
              nvidia.com/gpu: 1
            requests:
              cpu: 7
              ephemeral-storage: 20000M
              memory: 25000M
      tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists

Versions:

shubhamrai1993 commented 6 months ago

Triage - Going through the code, specifically this part - link, the algorithm seems to be -

shubhamrai1993 commented 6 months ago

I'm looking at the issue and will raise a proposed PR in some time

njtran commented 6 months ago

Can you share your NodePools? Can you also share your logs?

shubhamrai1993 commented 6 months ago

@njtran This is the node pool -

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 300s
    consolidationPolicy: WhenEmpty
    expireAfter: 168h
  template:
    metadata:
      labels:
        nvidia.com/gpu.deploy.operands: "true"
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: gpu-default
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - eu-west-1a
        - eu-west-1b
        - eu-west-1c
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - p2
        - p3
        - g4dn
        - g5
        - p4d
        - p4de
      - key: karpenter.k8s.aws/instance-size
        operator: NotIn
        values:
        - nano
        - micro
        - metal
      taints:
      - effect: NoSchedule
        key: nvidia.com/gpu
  weight: 10

These are the relevant logs -

{"level":"INFO","time":"2024-04-05T12:22:46.848Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"shub-ws/test-g5-66d4744f96-89sb8, shub-ws/test-g5-66d4744f96-bhpnh","duration":"1.548947822s"}
{"level":"INFO","time":"2024-04-05T12:22:51.794Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"17dd42b","nodeclaim":"gpu-nodepool-tsll8","provider-id":"aws:///eu-west-1a/i-0202813a4076f1b31","node":"ip-10-2-20-175.eu-west-1.compute.internal","allocatable":{"cpu":"47810m","ephemeral-storage":"143870061124","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"187109032Ki","nvidia.com/gpu":"4","pods":"737"}}
{"level":"INFO","time":"2024-04-05T12:23:02.393Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"shub-ws/test-g5-66d4744f96-89sb8, shub-ws/test-g5-66d4744f96-bhpnh","duration":"7.041627143s"}
{"level":"INFO","time":"2024-04-05T12:23:02.393Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"17dd42b","nodeclaims":1,"pods":2}
{"level":"INFO","time":"2024-04-05T12:23:02.686Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"17dd42b","nodepool":"gpu-nodepool","nodeclaim":"gpu-nodepool-6h4qd","requests":{"cpu":"14525m","ephemeral-storage":"39324644Ki","memory":"49456861Ki","nvidia.com/gpu":"2","pods":"10"},"instance-types":"g5.12xlarge, g5.24xlarge, g5.48xlarge"}
{"level":"INFO","time":"2024-04-05T12:23:07.594Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"17dd42b","nodeclaim":"gpu-nodepool-6h4qd","provider-id":"aws:///eu-west-1a/i-0015a5b2578cc50be","instance-type":"g5.12xlarge","zone":"eu-west-1a","capacity-type":"on-demand","allocatable":{"cpu":"47810m","ephemeral-storage":"134Gi","memory":"173400Mi","nvidia.com/gpu":"4","pods":"737","vpc.amazonaws.com/pod-eni":"107"}}
njtran commented 3 months ago

/kind feature

njtran commented 3 months ago

Hey @shubhamrai1993, sorry for the extensive delay on the response here. You're right in that our current algorith does:

  1. look at the existing buckets (nodes)
  2. look at the buckets we think we're about to create
  3. create a new bucket if 1/2 wasn't possible

This is a core part of the First Fit Decreasing bin-packing algorithm, where what you're suggesting is definitely a possible extension/modification to the algorithm. I'd want to explore this in an RFC so we can understand how this cost check changes our assumptions with our current algorithm, which currently short-circuits when we find a match. We try our best to be right, but bin-packing is an NP hard problem, and we can't guarantee we're perfectly optimal, like you're seeing now. Each modification can have an impact in other cases.

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale