Dynamic Resource Allocation

jonathan-innis commented 2 months ago

Description

What problem are you trying to solve?

If you haven't heard, there's a lot of buzz in the community about this thing called "Dynamic Resource Allocation." Effectively, it's a change to the existing Kubernetes resource model that would allow users to select against resources surfaced through a ResourceSlice object associated with a node that exposes Node hardware. Users create a ResourceClaim and perform selection through attribute-based selection using Common Expression Language.

The proposal for this change is documented here where there is a ton of discussion for the use-cases and the implications throughout the Kubernetes project.

The change to the resource model is of particular importance to Karpenter since we rely deeply on this resource model to know whether a pod is eligible to schedule against an instance type which we can think of as a "theoretical" node. Effectively, Karpenter now needs to be aware of the concepts ResourceSlice and ResourceClaim to know which instance types have the hardware required to schedule a set of pods. As Karpenter performs scheduling against these ResourceSlices it needs to simulate a pod taking up that hardware and rule out an instance type when the hardware can no longer fit the pods scheduling against it.

This has some relation to https://github.com/kubernetes-sigs/karpenter/issues/751 but I think we can decouple for now. DRA only requires what we know what the model would look like if the node were to launch, it doesn't necessitate that we allow users to specify arbitrary resources.

CloudProviders can first-class a set of resources it knows will appear in the ResourceSlices when the node comes up and hand that back in the GetInstanceTypes call for the scheduler to reason about. Some solid use-cases for this are things like NVIDIA GPUs whose hardware is well-known before launching the instance type or AWS's Inferentia accelerators.

Tasks

I want to build-out a set of Tasks that can be taken up to get a PoC for this working. Ideally, someone could build this out with kwok and then we could apply the same changes to the Azure and AWS providers.

[ ] Add ResourceSlice to the CloudProvider InstanceType model
[ ] Add ResourceSlice returning from the GetInstanceTypes() call in Kwok
[ ] Handle adding ResourceClaims to pod requirements
[ ] Handle ResourceClaim/ResourceSlice compatibility with CEL resolution
[ ] Handle simulating ResourceClaims against ResourceSlices through CEL (the tricky bit)

Working Group

Separately, if you are interested in attending the Working Group and contributing to other use-cases around DRA, the log is here and the official working group charter and meeting times are here

The YouTube Playlist for previous meetings can also be found here.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jonathan-innis commented 2 months ago

/triage accepted

jonathan-innis commented 2 months ago

IMO, it makes a lot of sense to build-out a staging/dra branch for the PoC work here. We can start building out the changes and collaborate on them without pulling them into the main branch. This is definitely going to be important since the DRA stuff is in beta and still in flux.

uniemimu commented 2 months ago

This is definitely going to be important since the DRA stuff is in beta and still in flux.

DRA is alpha. DRA beta ETA is 1.32. Starting the work aligned with KEP 4381 makes sense.

jonathan-innis commented 3 weeks ago

Update: There is another KEP (that is probably the more up-to-date one) that proposes a bunch of changes in 1.31: https://github.com/kubernetes/enhancements/pull/4709. I'd encourage folks who are interested to take a look at it and see what we think about how it fits in with Karpenter's scheduling logic.

As @uniemimu called out, the current target is 1.32 for the API that is proposed in the KEP to go to beta.

kubernetes-sigs / karpenter