Open vickywh opened 1 year ago
BTW, fractional resource allocation for device plugins is not part of Kubernetes' API.
Extended resources are only supported as integer resources and cannot be overcommitted.
—https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
The future way to provide that kind of behavior might be dynamic resource allocation.
I recommend being wary of supporting any vendor bodges that don't play well with Kubernetes' supported APIs. Instead, let's encourage the device vendors to support improvements either to device plugins or to dynamic allocation.
The NVIDIA device plugin already makes an integral value available for resource allocation called nvidia.com/gpu.shared
. This value is based on the value of nvidia.com/gpu
multiplied by the device plugin configuration's value for replicas
. For example, if the value of nvidia.com/gpu
is 8
and the value of replicas
is 5
, then the value of nvidia.com/gpu.shared
will be equal to 40. See, e.g., https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/
The challenge here is that Karpenter only knows about the physical resources provided by a given instance type. It cannot currently determine a priori how many nvidia.com/gpu.shared
resources a node will have, because the value depends on a runtime configuration made by the administrator. The value might not even be consistent among different nodes. And dynamic resource allocation won't solve the problem of how to choose what kind of instance type Karpenter should launch.
A few ideas I can think of to (crudely) solve this issue are:
Configure Karpenter with a multiplier value that is the same value provided to the device plugin, so that whenever a pod requests a value of nvidia.com/gpu.shared
, each instance's whole GPU count is multiplied by the multiplier and used to estimate the value of nvidia.com/gpu.shared
for instance launch purposes.
Configure Karpenter to read the NVIDIA device plugin configuration and determine the multiplier value itself.
Other ideas welcome!
Configure Karpenter to read the NVIDIA device plugin configuration and determine the multiplier value itself
This is something that we've thought about a bit. The problem seems tough for a couple of reasons:
It would be really nice if there was a streamlined upstream opinionated stance on how to do GPU time-slicing so that all GPU plugins had a unified vision for how this should be achieved.
What is the current behavior here when only asking for 1 GPU resource? @vickywh your example above asks for 4 GPU resources when your instance types explicitly are only 1 GPU machines. I can see why that would cause the errors like you've described above.
My use case is more basic time slicing where I will never have a workload with more than 1 GPU requested (sliced or not).
Without testing this myself, I think the behavior that karpenter would provide in an example where my cluster requests 8 GPUs, is that Karpenter would first spin up an 8 GPU physical node. If that physical node is sliced to 10 (80 "total") my workload would greatly under utilize the node there there would probably be thrash to consolidate other resources onto the node to better bin pack it. Scaling out would over provision considerably if I have a high slice number, but perhaps if my workload fluctuates over time the consolidation steps would help bring things back down.
I am curious if anyone has tried this specific use case and if the scaling up thrash is the only potential downside.
EDIT: I think this will still be cheaper in the long run, because the next batch of GPU requests will simply be scheduled on the overprovisioned GPU node. Right now, all my workloads are using 1 physical GPU, so once the node spins up and becomes sliced and other GPU requests are allocated it it, I'm saving money.
My initial testing yields positive results and is what I generally outlined above.
Scenario:
Outcome:
Some caveats I have seen so far:
When you have everything tuned nicely, the only downside appears to be the small bit of thrash during provisioning. Otherwise works as advertised. While I understand this is not nearly a perfect solution, it shows that an implementation right now can work in the right scenarios.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Tell us about your request
Karpenter should respect time-slicing configuration when provisioning gpu instances
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
When you use time slicing and the NVIDIA/k8s-device-plugin, Karpenter can fail to provision new instances
As per the documentation here: https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing
To implement time-slicing you would configure:
version: v1 sharing: timeSlicing: renameByDefault:
failRequestsGreaterThanOne:
resources:
the outcome of this being that if you did $ kubectl describe node you’d see
Capacity: nvidia.com/gpu: < num-replicas >
Karpenter respects that when scheduling pods, for example 6 pods which each request 4 GPU will be placed onto a node when num-replicas is 24.
However, when it comes to scaling, the following error occurs:
{"memory":"4Gi","nvidia.com/gpu":"4","pods":"1"},
because the deployment is requesting
but the instance types in the Provisioner only have 1 GPU:
spec: requirements:
When evaluating which instance to provision, Karpenter should evaluate the number of GPUs specified in < num-replicas >, not the number of GPUs that instance type/size has, and therefore provision a new instance (in this example)
Are you currently working around this issue?
No
Additional Context
No response
Attachments
No response
Community Note