Karpenter takes sub-optimal decisions from a pricing point of view

shubhamrai1993 commented 6 months ago

Description

Observed Behavior: I scheduled a deployment with two pods requiring single a10g GPUs (g5 machines) on AWS in a single zone. Karpenter created one g5.12xlarge with 4 A10G GPUs

Expected Behavior: Karpenter should have created two separate machines with 1 a10g GPU cards, so 2 separate g5.2xlarge machines looking at the cost difference which comes to around $2338.56 per month for this case.

Reproduction Steps (Please include YAML): deployment.yaml -

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-g5
spec:
  replicas: 0
  selector:
    matchLabels:
      truefoundry.com/component: test-g5
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - eu-west-1a
                  - key: karpenter.k8s.aws/instance-family
                    operator: In
                    values:
                      - g5
                  - key: karpenter.sh/capacity-type
                    operator: In
                    values:
                      - on-demand
      containers:
          image: testcontainers/helloworld
          imagePullPolicy: IfNotPresent
          name: test-g5
          resources:
            limits:
              cpu: 8
              ephemeral-storage: 100000M
              memory: 30000M
              nvidia.com/gpu: 1
            requests:
              cpu: 7
              ephemeral-storage: 20000M
              memory: 25000M
      tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists

Versions:

Chart Version: v0.27.2

Kubernetes Version (kubectl version):

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", 
GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:14:48Z", 
GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"28+", GitVersion:"v1.28.6-eks-508b6b3", 
GitCommit:"25a726351cee8ee6facce01af4214605e089d5da", GitTreeState:"clean", BuildDate:"2024-01-29T20:58:56Z", 
GoVersion:"go1.20.13", Compiler:"gc", Platform:"linux/amd64"}

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

shubhamrai1993 commented 6 months ago

Triage - Going through the code, specifically this part - link, the algorithm seems to be -

1st pod is scheduled to a new nodeClaim in-memory with possible machines being g5.2xlarge, 4xlarge … 12xlarge, 48xlarge
For 2nd pod, this pod is scheduled on the same nodeClaim and g5.2xlarge, 4xlarge are filtered out leaving 12xlarge and above
The price based sorting happens afterwards here

shubhamrai1993 commented 6 months ago

I'm looking at the issue and will raise a proposed PR in some time

njtran commented 6 months ago

Can you share your NodePools? Can you also share your logs?

shubhamrai1993 commented 6 months ago

@njtran This is the node pool -

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 300s
    consolidationPolicy: WhenEmpty
    expireAfter: 168h
  template:
    metadata:
      labels:
        nvidia.com/gpu.deploy.operands: "true"
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: gpu-default
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - eu-west-1a
        - eu-west-1b
        - eu-west-1c
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - p2
        - p3
        - g4dn
        - g5
        - p4d
        - p4de
      - key: karpenter.k8s.aws/instance-size
        operator: NotIn
        values:
        - nano
        - micro
        - metal
      taints:
      - effect: NoSchedule
        key: nvidia.com/gpu
  weight: 10

These are the relevant logs -

{"level":"INFO","time":"2024-04-05T12:22:46.848Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"shub-ws/test-g5-66d4744f96-89sb8, shub-ws/test-g5-66d4744f96-bhpnh","duration":"1.548947822s"}
{"level":"INFO","time":"2024-04-05T12:22:51.794Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"17dd42b","nodeclaim":"gpu-nodepool-tsll8","provider-id":"aws:///eu-west-1a/i-0202813a4076f1b31","node":"ip-10-2-20-175.eu-west-1.compute.internal","allocatable":{"cpu":"47810m","ephemeral-storage":"143870061124","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"187109032Ki","nvidia.com/gpu":"4","pods":"737"}}
{"level":"INFO","time":"2024-04-05T12:23:02.393Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"shub-ws/test-g5-66d4744f96-89sb8, shub-ws/test-g5-66d4744f96-bhpnh","duration":"7.041627143s"}
{"level":"INFO","time":"2024-04-05T12:23:02.393Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"17dd42b","nodeclaims":1,"pods":2}
{"level":"INFO","time":"2024-04-05T12:23:02.686Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"17dd42b","nodepool":"gpu-nodepool","nodeclaim":"gpu-nodepool-6h4qd","requests":{"cpu":"14525m","ephemeral-storage":"39324644Ki","memory":"49456861Ki","nvidia.com/gpu":"2","pods":"10"},"instance-types":"g5.12xlarge, g5.24xlarge, g5.48xlarge"}
{"level":"INFO","time":"2024-04-05T12:23:07.594Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"17dd42b","nodeclaim":"gpu-nodepool-6h4qd","provider-id":"aws:///eu-west-1a/i-0015a5b2578cc50be","instance-type":"g5.12xlarge","zone":"eu-west-1a","capacity-type":"on-demand","allocatable":{"cpu":"47810m","ephemeral-storage":"134Gi","memory":"173400Mi","nvidia.com/gpu":"4","pods":"737","vpc.amazonaws.com/pod-eni":"107"}}

njtran commented 3 months ago

/kind feature

njtran commented 3 months ago

Hey @shubhamrai1993, sorry for the extensive delay on the response here. You're right in that our current algorith does:

look at the existing buckets (nodes)
look at the buckets we think we're about to create
create a new bucket if 1/2 wasn't possible

This is a core part of the First Fit Decreasing bin-packing algorithm, where what you're suggesting is definitely a possible extension/modification to the algorithm. I'd want to explore this in an RFC so we can understand how this cost check changes our assumptions with our current algorithm, which currently short-circuits when we find a match. We try our best to be right, but bin-packing is an NP hard problem, and we can't guarantee we're perfectly optimal, like you're seeing now. Each modification can have an impact in other cases.

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes-sigs / karpenter

Karpenter takes sub-optimal decisions from a pricing point of view #1149

Description