karpenter doesn't scaleout when using time-slicing for GPUs

kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.

Apache License 2.0

450 stars 153 forks source link

karpenter doesn't scaleout when using time-slicing for GPUs #729

Open vickywh opened 1 year ago

vickywh commented 1 year ago

Tell us about your request

Karpenter should respect time-slicing configuration when provisioning gpu instances

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

When you use time slicing and the NVIDIA/k8s-device-plugin, Karpenter can fail to provision new instances

As per the documentation here: https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing

To implement time-slicing you would configure:

version: v1 sharing: timeSlicing: renameByDefault: failRequestsGreaterThanOne: resources:

name: < resource-name > replicas: < num-replicas >

the outcome of this being that if you did $ kubectl describe node you’d see

Capacity: nvidia.com/gpu: < num-replicas >

Karpenter respects that when scheduling pods, for example 6 pods which each request 4 GPU will be placed onto a node when num-replicas is 24.

However, when it comes to scaling, the following error occurs:

{"memory":"4Gi","nvidia.com/gpu":"4","pods":"1"},

because the deployment is requesting

      resources:
        limits:
          [nvidia.com/gpu:](http://nvidia.com/gpu:) '4'
        requests:
          memory: 4Gi
          [nvidia.com/gpu:](http://nvidia.com/gpu:) '4'

but the instance types in the Provisioner only have 1 GPU:

spec: requirements:

key: node.kubernetes.io/instance-type operator: In values:
- g5.2xlarge
- g5.4xlarge

When evaluating which instance to provision, Karpenter should evaluate the number of GPUs specified in < num-replicas >, not the number of GPUs that instance type/size has, and therefore provision a new instance (in this example)

Are you currently working around this issue?

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

sftim commented 1 year ago

BTW, fractional resource allocation for device plugins is not part of Kubernetes' API.

Extended resources are only supported as integer resources and cannot be overcommitted.

—https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

The future way to provide that kind of behavior might be dynamic resource allocation.

I recommend being wary of supporting any vendor bodges that don't play well with Kubernetes' supported APIs. Instead, let's encourage the device vendors to support improvements either to device plugins or to dynamic allocation.

otterley commented 1 year ago

The NVIDIA device plugin already makes an integral value available for resource allocation called nvidia.com/gpu.shared. This value is based on the value of nvidia.com/gpu multiplied by the device plugin configuration's value for replicas. For example, if the value of nvidia.com/gpu is 8 and the value of replicas is 5, then the value of nvidia.com/gpu.shared will be equal to 40. See, e.g., https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/

The challenge here is that Karpenter only knows about the physical resources provided by a given instance type. It cannot currently determine a priori how many nvidia.com/gpu.shared resources a node will have, because the value depends on a runtime configuration made by the administrator. The value might not even be consistent among different nodes. And dynamic resource allocation won't solve the problem of how to choose what kind of instance type Karpenter should launch.

A few ideas I can think of to (crudely) solve this issue are:

Configure Karpenter with a multiplier value that is the same value provided to the device plugin, so that whenever a pod requests a value of nvidia.com/gpu.shared, each instance's whole GPU count is multiplied by the multiplier and used to estimate the value of nvidia.com/gpu.shared for instance launch purposes.
Configure Karpenter to read the NVIDIA device plugin configuration and determine the multiplier value itself.

Other ideas welcome!

jonathan-innis commented 1 year ago

Configure Karpenter to read the NVIDIA device plugin configuration and determine the multiplier value itself

This is something that we've thought about a bit. The problem seems tough for a couple of reasons:

We want to surface the ability for Karpenter to understand extended resources so that we can not just support individual use-cases like this but support the whole array of extended resource plugins that users are using in the ecosystem
I've looked into GPU-time slicing briefly and it seems like each GPU manufacturer and plugin has their own methodology to surface time-slicing. Some do what Nvidia does (surface the slicing through a CM and use that to surface the "shared" resource through the plugin) others seems to configure it through the naming of the resource directly. We definitely need a fully fleshed-out one-pager that describes the most common ways that users are doing time-slicing so we can make sure whatever solution we come up with meets all of the major uses.

It would be really nice if there was a streamlined upstream opinionated stance on how to do GPU time-slicing so that all GPU plugins had a unified vision for how this should be achieved.

madisonb commented 1 year ago

What is the current behavior here when only asking for 1 GPU resource? @vickywh your example above asks for 4 GPU resources when your instance types explicitly are only 1 GPU machines. I can see why that would cause the errors like you've described above.

My use case is more basic time slicing where I will never have a workload with more than 1 GPU requested (sliced or not).

Without testing this myself, I think the behavior that karpenter would provide in an example where my cluster requests 8 GPUs, is that Karpenter would first spin up an 8 GPU physical node. If that physical node is sliced to 10 (80 "total") my workload would greatly under utilize the node there there would probably be thrash to consolidate other resources onto the node to better bin pack it. Scaling out would over provision considerably if I have a high slice number, but perhaps if my workload fluctuates over time the consolidation steps would help bring things back down.

I am curious if anyone has tried this specific use case and if the scaling up thrash is the only potential downside.

EDIT: I think this will still be cheaper in the long run, because the next batch of GPU requests will simply be scheduled on the overprovisioned GPU node. Right now, all my workloads are using 1 physical GPU, so once the node spins up and becomes sliced and other GPU requests are allocated it it, I'm saving money.

madisonb commented 1 year ago

My initial testing yields positive results and is what I generally outlined above.

Scenario:

Pods request only 1 GPU max
Karpenter provisioned into cluster
Nvidia Daemonset deployed with Y slices

Outcome:

Karpenter listens to cluster needs like normal, and will attempt to provision N physical GPUs when N new pods are requested. This provisions like normal, spinning as many nodes as needed to support the physical GPU requests
Once the first node spins up, the Nvidia daemonset slices the GPU into N * Y sliced gpus. All pods that can pack onto the node will do so. If I received a 2 gpu machine, sliced four ways, 8 GPU pods will deploy onto the node
The remaining nodes that did not spin up first will have pods deployed to them as needed, but in many cases they may become dormant. Karpenter successfully cleans these nodes up once they have been idle. This thrash may only last for a few minutes while the node spins up, then down because nothing was scheduled on it.

Some caveats I have seen so far:

The sliced GPU pods appear to share memory, you need to make sure all your models can fit into the GPU. There are many strange errors that may occur here: CUDA OOM, unable to load files, things you've never seen before. YMMV
There does not appear to be any distinction between memory allocated to the pod (requests/limits) and GPU memory. I have not found a way to properly allocate pods other than trial and error on the slice count
Certain node types may work better for slicing than others for your models, again YMMV

When you have everything tuned nicely, the only downside appears to be the small bit of thrash during provisioning. Otherwise works as advertised. While I understand this is not nearly a perfect solution, it shows that an implementation right now can work in the right scenarios.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jonathan-innis commented 5 months ago

/remove-lifecycle stale

k8s-triage-robot commented 2 months ago