Open pohly opened 2 years ago
Before we automate something (i.e., add templating and automatic scheduling), we need the manual flow to work
This is a pretty big statement. I worry that the things we need for manual selection may be the the things we DON'T WANT for automation. Giving the user too much control can be an atteactive nuisance which makes the "real" solution harder. Pre-provisioned volumes are a lesson I took to heart.
- Users evaluate their workloads and decide on the set of resources needed on each node.
- Users manually select the set of nodes on which they expect those workloads to run, and label those sets of nodes.
- Users pre-provision the resources on those nodes using ResourceClaims that specify those nodes.
- Users create deployments or other workload controller resources to provision the pods, using a nodeSelector to map those to the set of nodes with the pre-provisioned resources.
In this model, what is the value of ResourceClaims above simple Device Plugins? Or maybe I misunderstand the proposal? I read this as: label nodes with GPUs, use node selectors, use devioce plugins to get access to GPUs. Which I think is what people already do, right?
IOW it uses node-management as a substitute for (coarse) resource management.
I am certainly seeing a trend of people running (effectively) one workload pod per node, which fits this model. If we tidy that up and codify it (even just saying "here's a pattern you should not feel bad about"), does it relieve some of the pressure?
A manual (external) flow could work for things that you can attach to nodes, especially if they are then dedicated to the thing they attach to. Device plugins provide the ability for the thing you attach to work, but they don't attach it for you.
Something like https://cloud.google.com/compute/docs/gpus/add-remove-gpus -something triggers making a ResourceClaim, and automation fulfils it by attaching a GPU to a VM. NFD runs and labels the node. There is probably a device plugin in this story; anyway, we end up with a node that has some GPU capacity available.
Next, someone runs a Job that selects for that kind of GPU, and it works. However, what's possibly missing at this stage is the ability to reserve that GPU resource from the claim and avoid it being used for other Pods.
If we want to run a second Pod, maybe we're able to attach a second GPU, avoiding the one-pod-per-node problem. We aren't doing DRA but we have helped some stories and narrowed what still needs delivering.
Pre-provisioned volumes are a lesson I took to heart.
Those were also what came to my mind when I read @johnbelamaric's outline. Pre-provisioned volumes have been replaced by CSI volume provisioning, but now that also is stuck having to support the complicated "volume binding" between PVC and PV. The result is that there are still race conditions that can lead to leaked volumes. Let's not repeat this for DRA.
Here's why I think it could be different.
However:
Let's say you're attaching GPUs to a node, and you make a ResourceClaim to specify there should be 2 GPUs attached. The helper finds there's already one GPU manually / previous attached. How about we specify that this is not only undefined behaviour, but that we expect drivers to taint the node if they see it. No need to have complex logic around manual and automatic provisioning; something is Wrong and the node might not be any good now.
If the helper is told to add 2 GPUs for a total of 2 attached GPUs and finds 0 attached GPUs: great! We get nodes with GPUs, no need to taint, and other consequences such as NFD and device plugin registration can all happen.
Does that work?
This is a pretty big statement. I worry that the things we need for manual selection may be the the things we DON'T WANT for automation. Giving the user too much control can be an atteactive nuisance which makes the "real" solution harder. Pre-provisioned volumes are a lesson I took to heart.
Fair point. But I think the trouble comes in more when you don't have a clear sense of ownership - that is, mixed manual and automated flows. If the automated flows have full ownership (overriding anything the user may have done).
In this model, what is the value of ResourceClaims above simple Device Plugins? Or maybe I misunderstand the proposal? I read this as: label nodes with GPUs, use node selectors, use devioce plugins to get access to GPUs. Which I think is what people already do, right?
I am not sure there is one, except that we are rebuilding a model that can be further extended to the full support. Another alternative may be to extend those existing mechanisms, rather than invent a new one.
My main goal with spit balling some alternatives is to see if we can deliver incremental scope in a useful bug digestible way. My thinking is:
2 and 3 may have to go together, but I was hoping to find a solution to 1. With this approach, the functionality delivered in 1 can be done earlier because it is simpler, and it does not need to material change as we implement 2 and 3, reducing risk.
Admittedly I may be cutting things up wrong...but I think there is merit to the approach.
- I can solve my use case with a combination of basic primitives and client-side tooling, but it requires me to know a lot of stuff about resources in my nodes.
This can be done with the current API minus PodSchedulingContext
and without adding numeric parameters: write a DRA driver kubelet plugin and a corresponding controller, then use immediate allocation with some driver specific claim parameters that select the node (by name or by labels). There's no need for an API extension specifically for this mode of operation.
Perhaps some hardware vendor will even write such a DRA driver for you, if some of their customers want to use it like this - I can't speak for either of them. This is why the KEP has "collect feedback" as one of the graduation criteria.
This might even give you option 2 and 3. I am not sure whether I grasp the difference between them :shrug:
This might even give you option 2 and 3. I am not sure whether I grasp the difference between them 🤷
I am not sure there is one!
Immediate provisioning doesn't ensure that the node on which the resource was provisioned can fit the pod's other dimensions, but maybe that's OK? The advantage of simple, stupid, counted resources is that the scheduler has all the information it needs about all requests.
What I am trying to get to, and I think John is aiming at the same goal (not that you're not, Patrick :), is to say what is the biggest problem people really experience today, and how can we make that better? A year ago I would have said it was GPU sharing. Now I understand (anecdotally) that sharing is far less important than simply getting out of the way for people who want to use whole GPUs.
Here's my uber-concern: k8s is the distillation of more than a decade of real-world experience running serving workloads (primarily) on mostly-fungible hardware. The game has changed, and we don't have a decade of experience. Anything we do right now has better than even odds of being wrong within a short period of time. The hardware is changing. Training vs. inference is changing. Capacity is crunched everywhere, and there's a sort of "gold rush" going on.
What can we do to make life better for people? How can we help them improve their efficiency and their time to market? Everything else seems secondary.
I ACK that this is GPU-centric and that DRA does more than just GPUs.
I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere.
@klueska, @byako : I'll punt this to you.
Just beware that it would mean that we need to rush another KEP for "PodSchedulingContext" for 1.30 and add a feature gate for that - I'm a bit worried that we are stretching ourselves too thin when we do that, and we also skip all of the usual "gather feedback" steps for beta. I'd much rather focus on numeric parameters...
The advantage of simple, stupid, counted resources is that the scheduler has all the information it needs about all requests.
That sounds like "numeric parameters", which we cannot promote to beta in 1.30. When using "numeric parameters", PodSchedulingContext is indeed not needed.
Right. The numeric parameters approach keeps emerging as a potentially better path forward, but I really don't think we know enough to proclaim any of this as beta. If there are things we can do that are smaller and more focused, which would solve some of the problems, I am eager to explore that.
If we were starting from scratch RIGHT NOW, with no baggage, what would we be trying to achieve in 30?
Right. The numeric parameters approach keeps emerging as a potentially better path forward, but I really don't think we know enough to proclaim any of this as beta. If there are things we can do that are smaller and more focused, which would solve some of the problems, I am eager to explore that.
Yeah. I concede nothing in beta in 1.30. That was my original position anyway, but I though perhaps if we scoped something way down but kept continuity, we could do it. But it's clearly a no-go.
If we were starting from scratch RIGHT NOW, with no baggage, what would we be trying to achieve in 30?
Great question, that is what I am looking for along the lines of MVP. What we really need to do is go back to the use cases for that, which is pretty hard on this tight timeline. The better option may be to ask "what would we be trying to achieve" cutting out the "in 30", and defer that question to 31. In the meantime, maybe we make some incremental steps in the direction we need in 30 based on what we know so far - something like @pohly is saying here plus numerical models.
I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere.
Can you expand on this? What becomes core DRA? Are we going to end up in the same inconsistent situation that the topology manager is? The one where kube-scheduler sends a Pod to a Node and the kubelet just rejects it.
I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere.
Can you expand on this? What becomes core DRA?
Are we going to end up in the same inconsistent situation that the topology manager is? The one where kube-scheduler sends a Pod to a Node and the kubelet just rejects it.
I think it's clear at this point that nothing is going beta in 1.30. We will make sure to avoid the issue you are describing. I think we should err on the side of "failing to schedule" rather than "scheduling and failing".
/milestone clear
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/label lead-opted-in /milestone v1.31
this will be worked on, correct?
Yes, though what we will be doing is moving the "classic DRA" under a separate feature gate, and the "structured parameters DRA" will be under the primary feature gate.
@johnbelamaric Is this KEP going for a beta stage as base KEP is marked for beta in v1.31 in https://github.com/kubernetes/enhancements/issues/3063#issuecomment-1915852197?
Also, as I understand from https://github.com/kubernetes/enhancements/issues/3063#issuecomment-1915742353, this KEP, 3063, is the base KEP while the template KEP will be a new one distinct from 3063. Please correct me if I'm wrong.
Both this and #4381 stay in alpha in 1.31.
Hello @pohly 👋, Enhancements team here.
Just checking in as we approach enhancements freeze on 02:00 UTC Friday 14th June 2024 / 19:00 PDT Thursday 13th June 2024.
This enhancement is targeting stage alpha
for v1.31 (correct me if otherwise)
Here's where this enhancement currently stands:
implementable
for latest-milestone: v1.31
. KEPs targeting stable
will need to be marked as implemented
after code PRs are merged and the feature gates are removed.With all the KEP requirements in place and merged into k/enhancements, this enhancement is all good for the upcoming enhancements freeze. 🚀
The status of this enhancement is marked as tracked for enhancement freeze
. Please keep the issue description up-to-date with appropriate stages as well. Thank you!
Hello @pohly :wave:, 1.31 Docs Shadow here.
Does this enhancement work planned for 1.31 require any new docs or modification to existing docs?
If so, please follows the steps here to open a PR against dev-1.31 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday June 27, 2024 18:00 PDT.
Also, take a look at Documenting for a release to get yourself familiarised with the docs requirement for the release.
Thank you!
@prianna: I was a bit late with updating the issue description. What you looked at was the KEP update for 1.30, not the one for 1.31. However, the one for 1.31 also got reviewed, approved and merged before the deadline, so the end result is the same. Sorry for the confusion!
@hacktivist123: I created a placeholder PR.
Thanks @pohly
Hi @pohly,
:wave: from the v1.31 Communications Team! We'd love for you to opt in to write a feature blog about your enhancement! Some reasons why you might want to write a blog for this feature include (but are not limited to) if this introduces breaking changes, is important to our users, or has been in progress for a long time and is graduating.
To opt in, let us know and open a Feature Blog placeholder PR against the website repository by 3rd July, 2024. For more information about writing a blog see the blog contribution guidelines.
Note: In your placeholder PR, use XX
characters for the blog date
in the front matter and file name. We will work with you on updating the PR with the publication date once we have a final number of feature blogs for this release.
Hey @pohly, friendly reminder about the upcoming blog opt-in and placeholder deadline on July 3rd. Please open a blog placeholder PR if you are interested in contributing a blog.
Hey again @pohly 👋 Enhancements team here,
Just checking in as we approach code freeze at 02:00 UTC Wednesday 24th July 2024 / 19:00 PDT Tuesday 23rd July 2024.
Here's where this enhancement currently stands:
approved
and lgtm
labels applied) by the code freeze deadline. This includes tests.For this enhancement, it looks like the following PRs are open and need to be merged before code freeze:
I see that you're earnestly working to converge on a suitable type definition. However, if you anticipate missing code freeze, you can file an exception request in advance.
Also, please let me know if there are other PRs in k/k we should be tracking for this KEP (and thanks for updating the description). As always, we are here to help if any questions come up. Thanks!
@hacktivist123 Hey there, why was this moved to "tracked for code freeze"? As far as I can see, the PR has not been merged yet. I'm moving it back, please let me know if there was some reason this was transitioned.
Hello @prianna that must have been a mixup with cleaning up my board.
Thanks for the catch!
https://github.com/kubernetes/kubernetes/pull/125488 got merged, so we are good for 1.31.
Thanks! Marking this KEP as tracked for code freeze 🎉
@pohly What is your intention for this feature going forward? DRA has split into sub issues but I guess you are keeping this open as the main feature to update?
I don't have any particular plan for this aspect of DRA. Given that others want it removed, some potential user would need to speak up soon or it will get removed, probably in 1.32.
So then technically we may want to do some work with this even if its deprecation?
I'm trying to figure out for sig-node what to consider this feature.
"Proposed For Release", "Not for Release"?
I suggest we raise it in SIG Node and WG Dev Mgmt with the following plan:
Hi, enhancements lead here - I inadvertently added this to the 1.32 tracking board 😀. Please readd it if you wish to progress this enhancement in 1.32.
/remove-label lead-opted-in
/milestone v1.32 /label lead-opted-in
Request for leaving this here a little longer, @klueska . We would like some time to go evaluate what is best, from the scheduling side. If we can have some time to try to resolve the complexity of the structured parameters and maybe simplify the classic, having this to play with would be really helpful. 1.33 should be okay to remove because by then we'll have a plan. Spoke with @johnbelamaric already and he suggested I leave this comment and request. We are also looking at handling CPU, as was referenced in the original DRA doc here https://docs.google.com/document/d/1XNkTobkyz-MyXhidhTp5RfbMsM-uRCWDoflUMqNcYTk/ but I'm aware that that may make this scope too complex.
@pohly can you lay out here the implications of #4381 going beta without first removing #3063? It's important that we make beta for #4381 in 1.32.
The two are independent since Kubernetes 1.31, with separate feature gates. Keeping #3063 as alpha does not block #4381 as beta. It also does not cause extra work (that was all already done for 1.31).
The two are independent since Kubernetes 1.31, with separate feature gates. Keeping #3063 as alpha does not block #4381 as beta. It also does not cause extra work (that was all already done for 1.31).
@klueska (or was it @SergeyKanzhelev?) mentioned that there are round tripping implications, is that accurate?
Probably Jordan.
If we don't remove it now, the following fields remain reserved forever:
They don't get set, but the names are "burned" and cannot be used for something else in the future. I think that's okay and won't block future extensions.
We're still using Classic DRA at the moment, and if it's removed, it's a big breaking change for us, so before we move to the Structure Parameter, we want the Classic DRA to be here for a while, thanks
We're still using Classic DRA at the moment, and if it's removed, it's a big breaking change for us, so before we move to the Structure Parameter, we want the Classic DRA to be here for a while, thanks
Alpha features don't have backwards compatibility guarantees. I suggest you start the migration process to structured DRA or elaborate on why it's not possible for you to migrate.
So we can wait another cycle, but perhaps expect a rename?
We can rename the existing fields now and then keep them forever. The earlier such a rename happens, the fewer people who need to update their code / integrations.
I don't think we need to find a technical solution to perpetuate an alpha feature specially when we are developing the alternative that solves the problem with the original one, also if there is a bug with classic DRA it will not be backported and most probably not fixed
We're still using Classic DRA at the moment, and if it's removed, it's a big breaking change for us, so before we move to the Structure Parameter, we want the Classic DRA to be here for a while, thanks
Request for leaving this here a little longer, @klueska . We would like some time to go evaluate what is best, from the scheduling side. I
@catblade @cyclinder it seems you are working with old versions of Kubernetes, can you explain which versions are you using now and how far are you from current development? we need more information than "please don't remove it" to objectively evaluate the cost of maintaining code that should not be used ... also is important to describe the exact problems and why you can not use the new one, is a custom DRA driver you already have? or one you are using from a third party?
I had a call with @catblade. She cannot share in public yet what she is working on, but I found it interesting and worth supporting by keeping classic DRA as alpha for another release. It's important to note that it's not about supporting some existing solution. Instead, she is currently exploring both classic DRA and structured parameters and wants to have all options available until she reaches a conclusion of that exploration.
@cyclinder already said on Slack that they will use classic DRA only with older Kubernetes and want to migrate to structured parameters for Kubernetes >= 1.31.
Enhancement Description
One-line enhancement description: dynamic resource allocation
Kubernetes Enhancement Proposal: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation
Discussion Link: CNCF TAG Runtime Container Device Interface (COD) Working Group meeting(s)
Primary contact (assignee): @pohly
Responsible SIGs: SIG Node
Enhancement target (which target equals to which milestone):
[x] Alpha (1.26)
k/enhancements
) update PR(s):k/k
) update PR(s): https://github.com/kubernetes/kubernetes/pull/111023k/website
) update PR(s):[x] Alpha (1.27)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):[x] Alpha (1.28)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s): https://github.com/kubernetes/website/pull/41856[x] Alpha (1.29)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):[x] Alpha (1.30)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):[x] Alpha (1.31)
k/enhancements
) update PR(s): https://github.com/kubernetes/enhancements/pull/4709k/k
) update PR(s): https://github.com/kubernetes/kubernetes/pull/125488k/website
) update PR(s): https://github.com/kubernetes/website/pull/46816[ ] Withdrawn (1.32)
k/enhancements
) update PR(s): https://github.com/kubernetes/enhancements/pull/4904k/k
) update PR(s): https://github.com/kubernetes/kubernetes/pull/128003k/website
) update PR(s): https://github.com/kubernetes/website/pull/48289