kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
450 stars 153 forks source link

karpenter doesn't scaleout when using time-slicing for GPUs #729

Open vickywh opened 1 year ago

vickywh commented 1 year ago

Tell us about your request

Karpenter should respect time-slicing configuration when provisioning gpu instances

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

When you use time slicing and the NVIDIA/k8s-device-plugin, Karpenter can fail to provision new instances

As per the documentation here: https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing

To implement time-slicing you would configure:

version: v1 sharing: timeSlicing: renameByDefault: failRequestsGreaterThanOne: resources:

the outcome of this being that if you did $ kubectl describe node you’d see

Capacity: nvidia.com/gpu: < num-replicas >

Karpenter respects that when scheduling pods, for example 6 pods which each request 4 GPU will be placed onto a node when num-replicas is 24.

However, when it comes to scaling, the following error occurs:

{"memory":"4Gi","nvidia.com/gpu":"4","pods":"1"},

because the deployment is requesting

      resources:
        limits:
          [nvidia.com/gpu:](http://nvidia.com/gpu:) '4'
        requests:
          memory: 4Gi
          [nvidia.com/gpu:](http://nvidia.com/gpu:) '4'

but the instance types in the Provisioner only have 1 GPU:

spec: requirements:

When evaluating which instance to provision, Karpenter should evaluate the number of GPUs specified in < num-replicas >, not the number of GPUs that instance type/size has, and therefore provision a new instance (in this example)

Are you currently working around this issue?

No

Additional Context

No response

Attachments

No response

Community Note

sftim commented 1 year ago

BTW, fractional resource allocation for device plugins is not part of Kubernetes' API.

Extended resources are only supported as integer resources and cannot be overcommitted.

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

The future way to provide that kind of behavior might be dynamic resource allocation.

I recommend being wary of supporting any vendor bodges that don't play well with Kubernetes' supported APIs. Instead, let's encourage the device vendors to support improvements either to device plugins or to dynamic allocation.

otterley commented 1 year ago

The NVIDIA device plugin already makes an integral value available for resource allocation called nvidia.com/gpu.shared. This value is based on the value of nvidia.com/gpu multiplied by the device plugin configuration's value for replicas. For example, if the value of nvidia.com/gpu is 8 and the value of replicas is 5, then the value of nvidia.com/gpu.shared will be equal to 40. See, e.g., https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/

The challenge here is that Karpenter only knows about the physical resources provided by a given instance type. It cannot currently determine a priori how many nvidia.com/gpu.shared resources a node will have, because the value depends on a runtime configuration made by the administrator. The value might not even be consistent among different nodes. And dynamic resource allocation won't solve the problem of how to choose what kind of instance type Karpenter should launch.

A few ideas I can think of to (crudely) solve this issue are:

  1. Configure Karpenter with a multiplier value that is the same value provided to the device plugin, so that whenever a pod requests a value of nvidia.com/gpu.shared, each instance's whole GPU count is multiplied by the multiplier and used to estimate the value of nvidia.com/gpu.shared for instance launch purposes.

  2. Configure Karpenter to read the NVIDIA device plugin configuration and determine the multiplier value itself.

Other ideas welcome!

jonathan-innis commented 1 year ago

Configure Karpenter to read the NVIDIA device plugin configuration and determine the multiplier value itself

This is something that we've thought about a bit. The problem seems tough for a couple of reasons:

  1. We want to surface the ability for Karpenter to understand extended resources so that we can not just support individual use-cases like this but support the whole array of extended resource plugins that users are using in the ecosystem
  2. I've looked into GPU-time slicing briefly and it seems like each GPU manufacturer and plugin has their own methodology to surface time-slicing. Some do what Nvidia does (surface the slicing through a CM and use that to surface the "shared" resource through the plugin) others seems to configure it through the naming of the resource directly. We definitely need a fully fleshed-out one-pager that describes the most common ways that users are doing time-slicing so we can make sure whatever solution we come up with meets all of the major uses.

It would be really nice if there was a streamlined upstream opinionated stance on how to do GPU time-slicing so that all GPU plugins had a unified vision for how this should be achieved.

madisonb commented 1 year ago

What is the current behavior here when only asking for 1 GPU resource? @vickywh your example above asks for 4 GPU resources when your instance types explicitly are only 1 GPU machines. I can see why that would cause the errors like you've described above.

My use case is more basic time slicing where I will never have a workload with more than 1 GPU requested (sliced or not).

Without testing this myself, I think the behavior that karpenter would provide in an example where my cluster requests 8 GPUs, is that Karpenter would first spin up an 8 GPU physical node. If that physical node is sliced to 10 (80 "total") my workload would greatly under utilize the node there there would probably be thrash to consolidate other resources onto the node to better bin pack it. Scaling out would over provision considerably if I have a high slice number, but perhaps if my workload fluctuates over time the consolidation steps would help bring things back down.

I am curious if anyone has tried this specific use case and if the scaling up thrash is the only potential downside.

EDIT: I think this will still be cheaper in the long run, because the next batch of GPU requests will simply be scheduled on the overprovisioned GPU node. Right now, all my workloads are using 1 physical GPU, so once the node spins up and becomes sliced and other GPU requests are allocated it it, I'm saving money.

madisonb commented 1 year ago

My initial testing yields positive results and is what I generally outlined above.

Scenario:

Outcome:

Some caveats I have seen so far:

When you have everything tuned nicely, the only downside appears to be the small bit of thrash during provisioning. Otherwise works as advertised. While I understand this is not nearly a perfect solution, it shows that an implementation right now can work in the right scenarios.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jonathan-innis commented 5 months ago

/remove-lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Bryce-Soghigian commented 2 months ago

/remove-lifecycle stale