kubernetes / enhancements

Enhancements tracking repo for Kubernetes
Apache License 2.0
3.39k stars 1.45k forks source link

DRA: control plane controller ("classic DRA") #3063

Open pohly opened 2 years ago

pohly commented 2 years ago

Enhancement Description

SergeyKanzhelev commented 1 year ago

what's the verdict? should we apply the milestone v1.27?

pohly commented 1 year ago

Let's apply the milestone. I think it reflects more accurately that we intend to work on this in this cycle. We might even do a resource.k8s.io/v1alpha2, so it may become user-visible.

dchen1107 commented 1 year ago

/label leads-opted-in

k8s-ci-robot commented 1 year ago

@dchen1107: The label(s) /label leads-opted-in cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor, lead-opted-in, tracked/no, tracked/out-of-tree, tracked/yes. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to [this](https://github.com/kubernetes/enhancements/issues/3063#issuecomment-1421159975): >/label leads-opted-in Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
SergeyKanzhelev commented 1 year ago

/label lead-opted-in

shatoboar commented 1 year ago

Hello @pohly 👋, Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00 PDT Thursday 9th February 2023.

This enhancement is targeting for stage alpha for v1.27 (correct me, if otherwise)

Here's where this enhancement currently stands:

For this KEP, we would just need to update the following:

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

marosset commented 1 year ago

@pohly @SergeyKanzhelev @ruiwen-zhao @dchen1107 - can one of you please open up a PR to update the milestones in https://github.com/kubernetes/enhancements/blob/810f5c77c91afe81f89ba32d97cea16698fcc953/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml#L27 for v1.27? Thanks!

marosset commented 1 year ago

Thanks @SergeyKanzhelev

This enhancement meets all the requirements to be included in v1.27.

shatoboar commented 1 year ago

Hi @pohly 👋, Checking in as we approach 1.27 code freeze at 17:00 PDT on Tuesday 14th March 2023. Please ensure the following items are completed:

Please let me know what other PRs in k/k I should be tracking for this KEP. As always, we are here to help should questions come up. Thanks!

pohly commented 1 year ago

The plan is to continue working on this feature in 1.28, while keeping it in alpha. Several code PRs are already pending.

mrunalp commented 1 year ago

/label lead-opted-in

Atharva-Shinde commented 1 year ago

Hello @pohly 👋, Enhancements Lead here.

Just checking in as we approach enhancements freeze on Thursday, 16th June 2023 which is just a few hours away.

Looks like this enhancement is staying in stage alpha for v1.28

Here's where this enhancement currently stands:

For this KEP, we would simply need to update the following:

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you :)

pohly commented 1 year ago

@Atharva-Shinde: the milestone was bumped via https://github.com/kubernetes/enhancements/pull/4096. I think we are ready to continue working on this feature in 1.28, right?

Atharva-Shinde commented 1 year ago

Hey @pohly

With reference to https://github.com/kubernetes/enhancements/pull/4096, all the KEP requirements are now in place and merged into k/enhancements, therefore this enhancement is all good for the upcoming enhancements freeze 🚀

The status of this enhancement is marked as tracked. Please keep the issue description up-to-date with appropriate stages as well. Thank you :)

Rishit-dagli commented 1 year ago

Hello @pohly :wave:, 1.28 Docs Lead here.

Does this enhancement work planned for 1.28 require any new docs or modification to existing docs?

If so, please follows the steps here to open a PR against dev-1.28 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday 20th July 2023.

Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.

Thank you!

Atharva-Shinde commented 1 year ago

Hey again @pohly :wave:

Just checking in as we approach Code freeze at 01:00 UTC Friday, 19th July 2023 .

Here’s the enhancement’s state for the upcoming code freeze:

I don't see any code (k/k) update PR(s) in the issue description so if there are any k/k related PR(s) that we should be tracking for this KEP please link them in the issue description above.

As always, we are here to help if any questions come up. Thanks!

Atharva-Shinde commented 1 year ago

Hey @pohly I don't see any implementation i.e code related PRs(merged or open) that are associated with this KEP for the current v1.28 milestone on this issue. But if there are code PRs that were merged/merge-ready state then please link all the related PRs ASAP. Until then unfortunately this enhancement is being removed from the v1.28 milestone.

If you still wish to progress this enhancement in v1.28, please file an exception request. Thanks!

/milestone clear

alculquicondor commented 1 year ago

@pohly can you link the PRs about scheduling hints?

klueska commented 1 year ago

@Atharva-Shinde

These are all of the PRs related to DRA that landed this cycle:

This one was planned but didn't make it:

Atharva-Shinde commented 1 year ago

Thanks @klueska for linking all the code PRs associated with this KEP, as all these PRs associated with the v1.28 milestone were merged/ were in merge-ready state by the code-freeze, I am adding this KEP back to the v1.28 milestone and marking it as tracked for the code freeze :) /milestone v1.28

alculquicondor commented 1 year ago

Please also add kubernetes/kubernetes#118551 and kubernetes/kubernetes#118438

Rishit-dagli commented 1 year ago

Hello @pohly wave: please take a look at Documenting for a release - PR Ready for Review to get your docs PR ready for review before Tuesday 25th July 2023. Thank you!

Ref: https://github.com/kubernetes/website/pull/41856

pohly commented 1 year ago

@klueska: thanks for collecting the PR list while I was on vacation.

Out of curiosity, how did that get into the issue description? Does the release team have permission to edit issues?

npolshakova commented 1 year ago

Hello 👋, 1.29 Enhancements Lead here.

If you wish to progress this enhancement in v1.29, please have the SIG lead opt-in your enhancement by adding the lead-opt-in and milestone v1.29 labels before the Production Readiness Review Freeze.

/remove-label lead-opted-in

pacoxu commented 1 year ago

This one was planned but didn't make it:

This PR is finally merged in v1.29. I commented on the issue description directly. Do we still target beta for v1.29?

Until now, below are merged PRs in the v1.29 release cycle:

Some ongoing PR:

pohly commented 1 year ago

We still target beta for 1.29. I'm currently doing a full pass over the KEP to ensure that everything is covered and documented properly.

alculquicondor commented 1 year ago

Was there any progress in the autoscaling story?

pohly commented 1 year ago

I discussed it at KubeCon EU this year with Maciek. We tossed around a few ideas on how autoscaler could get extended by vendors to cover their business logic, but in the end we'll need a plugin mechanism for it, similar to what is in the KEP. I still need to come up with specific code for it. That'll be the biggest work item for 1.29.

I got sidetracked by the scheduler issues, otherwise I would have started already on it :cold_sweat:

Here's the PR with the beta KEP update: https://github.com/kubernetes/enhancements/pull/4181

SergeyKanzhelev commented 11 months ago

/milestone v1.29 /stage beta /label lead-opted-in

npolshakova commented 11 months ago

Hello @pohly 👋, 1.29 Enhancements team here!

Just checking in as we approach enhancements freeze on 01:00 UTC, Friday, 6th October, 2023.

This enhancement is targeting for stage beta for 1.29 (correct me, if otherwise)

Here's where this enhancement currently stands:

It looks like https://github.com/kubernetes/enhancements/pull/4181/files will address most of these issues! Make sure to also update the stage in the kep.yaml

The status of this enhancement is marked as at risk for enhancement freeze. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

npolshakova commented 11 months ago

Hi @pohly, just checking in once more as we approach the 1.29 enhancement freeze deadline this week on 01:00 UTC, Friday, 6th October, 2023. The status of this enhancement is marked as at risk for enhancement freeze.

It looks like when https://github.com/kubernetes/enhancements/pull/4181 merges in it will address most of the requirements. Let me know if I missed anything. Thanks!

npolshakova commented 11 months ago

Hello @pohly 👋, 1.29 Enhancements Lead here. Unfortunately, this enhancement did not meet requirements for v1.29 enhancements freeze. Feel free to file an exception to add this back to the release tracking process.

It looks like there is still ongoing discussion for this KEP. Let me know if you have any questions!

Thanks!

/milestone clear

sftim commented 10 months ago

Tracking something to highlight it for graduation reviews: I had a concern that we might not be planning to document the resource driver API well enough for people to learn about and use it.

salehsedghpour commented 8 months ago

/remove-label lead-opted-in

SergeyKanzhelev commented 7 months ago

/stage alpha /milestone v1.30

salehsedghpour commented 7 months ago

Hello @pohly , 1.30 Enhancements team here! Is this enhancement targeting 1.30? If it is, can you follow the instructions here to opt in the enhancement and make sure the lead-opted-in label is set so it can get added to the tracking board? Thanks!

johnbelamaric commented 7 months ago

Hi all DRA-interested folks,

Patrick and I (and others) have been working to get some resolution to the DRA functionality for 1.30 and beyond. In #4384 we have been discussing how to break down the DRA KEPs.

I thought it would good to bring the discussion back here for visibility. It is very much an active discussion right now, we are not all in alignment. A few related discussions:

Here is my take (note: Patrick and I discussed this earlier today but he is still considering/not fully aligned with this, this is just my opinion at this time), which is an expansion on the last comment. As I see it, there are two primary reasons we can't go to beta with this right now:

  1. As written it doesn't support cluster autoscaling.
  2. It is very large, with a lot of functionality, covers a lot of use cases, and has touch points in some critical parts of the code base, which creates an unacceptable level of risk to take it all in one KEP.

We are working on a solution to the first one and I think making progress. The second one is primarily about the structure of the KEPs, their feature gates, and how they work together.

My suggestion, roughly described in the aforementioned comment, tries to reduce the risk while bringing useful bits to beta/GA piecemeal.

Base KEP Let's scope down this base KEP to a bare minimum viable product, which would target beta in 1.30 (this may be controversial):

The ResourceClaimTemplate and (if needed) PodSchedulingContext will be in a KEP that builds on this one; they will not be in the base KEP.

The other consideration for scope for the base KEP is the idea of immediate and delayed allocation. I think with the simplification of no templates, we can digest both immediate and delayed allocation in the same KEP.

In this MVP, all the scheduler has to do is make sure to pick the node that the resourceClaims are on. This eliminates 90% of the complexity, but still has some utility.

Yeah, it's not automatic in any way (thus the "minimal" in MVP). The user has to handcraft the choice of nodes, which sucks. It's a little painful for Deployments, because you have to manually pre-create the resource claims associated with specific nodes, label those nodes, and then use that in the nodeSelector field of your deployment Pod template. HPA should work with that. But cluster autoscaling may still be an issue (resolved in the template KEP), if CA does not allow us to create an ancillary resource next to the node. Nonetheless, this does allow actual use cases to be met, and it allows us to get something to beta ASAP that enables use cases (even if requiring a lot of client-side work). Getting to beta means we can get more soak time on the feature and reduce risk.

I would tentatively suggest this dramatically more limited functionality could go do beta in 1.30, given it is building on what is here (it's modifying existing code). Additionally, the primary goal in alpha is to make sure we have the API right. I don't think the API described above is controversial.

Template KEP We split out ResourceClaimTemplate into a new KEP, which would target alpha in 1.30. This would include the initial version of numerical models #4384, but NOT include the escape hatch of custom models. Without the escape hatch, we do not need PodSchedulingContext, since there is no need for the drivers and the scheduler to negotiate a node. Instead, the scheduler can use its normal flow plus the numerical model, and pick the node. Instead of just setting nodeName on the pod, it also sets it on the resource claims generated from the template.

At that point, it goes through the normal allocation process, just as if those claims had been handcrafted by a user.

Escape Hatch KEP If we decide we need the escape hatch "custom" numerical model (driver controller and scheduler negotiate the node), we can consider adding that and the related PodSchedulingContext. I think we can actually get the models to be expressive enough that this won't be needed, but the possibility is always there. Remember, the models don't have to be perfect, they can err on the side of caution initially and we can improve them over time to improve resource utilization.

sftim commented 7 months ago

My thinking: if .spec.resourceClaims goes beta for v1.30, the associated feature gate should be off by default. However, we can also ask cluster lifecycle tools to consider ways to help make it easy for people to experiment with DRA.

If we make something be enabled by default, and we one day find we regret doing that, then there's a lot more trouble for the cluster administrators who have to cope with reversion.

Even better, IMO: leave it all alpha until we know what the APIs all look like and we are confident they could and would work without changes.

johnbelamaric commented 7 months ago

Yeah, I hear you @sftim. It's probably too much change in the code to go straight to beta. In that case, I think the target path would be:

1.30

1.31

1.32

1.33

thockin commented 7 months ago

I appreciate the breakdown. That said -- beta doesn't really exist. There's alpha (off by default), GA with low-confidence, and GA with high(er) confidence. I'm very reluctant to "beta" (GA with low confidence) this if we don't have a plan for how it will evolve to support autoscaling.

johnbelamaric commented 7 months ago

Template KEP is that plan

thockin commented 7 months ago

I will keep reading

pohly commented 7 months ago

Let's scope down this base KEP to a bare minimum viable product, which would target beta in 1.30 (this may be controversial):

  • ResourceClaim, with the addition of an optional nodeName field so the node can be manually determined when necessary
  • PodSpec.ResourceClaims field ... It's a little painful for Deployments, because you have to manually pre-create the resource claims associated with specific nodes, label those nodes, and then use that in the nodeSelector field of your deployment Pod template.

I don't find that a "minimum viable product". No-one is going to use this, so we are not going to get more feedback even if we do promote this subset to beta. It also sounds like we need to implement new functionality that never was available as alpha, so how can we go to beta with it straight away?

The other downside is that we have to start adding more feature gate checks for specific fields, with all the associated logic (drop alpha fields, but only if not already set). This is adding work and complexity, and thus a risk to introduce new bugs.

If we have to reduce the scope for beta, then I would slice up the KEP differently if (and only if) needed. But I am not (EDIT) going to dive into the how because of this:

I asked in https://github.com/kubernetes/enhancements/pull/4384#issuecomment-1913110485 how many different feature gates we need in 1.30 when everything is still alpha. Let me repeat the key point: perhaps we don't need to decide now?

We could continue to use the existing DynamicResourceAllocation feature gate for everything. Then before promotion to beta, we add additional feature gates for things that remain in alpha. It would change how things get enabled compared to 1.30, but IMHO that is okay because it is an alpha feature, which can change from one release to the next.

The practical advantage is that for 1.30 we we can skip the entire discussion around how to promote this and instead have that discussion later, for example in a working session at the KubeCon EU 2024 contributor summit (I have submitted a session proposal). It also makes the 1.30 implementation simpler (no additional feature gate checks).

pohly commented 7 months ago

We split out ResourceClaimTemplate into a new KEP, which would target alpha in 1.30. [...] Template KEP is that plan [for autoscaling]

The ResourceClaimTemplate is not what enables autoscaling. It solves the problem of per-pod resource claims when pods get generated by an app controller. This part also doesn't seem to be controversial, at least not anymore after I changed to dynamically generated names :wink:.

My plan for supporting autoscaling are numeric parameters.

johnbelamaric commented 7 months ago

We split out ResourceClaimTemplate into a new KEP, which would target alpha in 1.30. [...] Template KEP is that plan [for autoscaling]

The ResourceClaimTemplate is not what enables autoscaling. It solves the problem of per-pod resource claims when pods get generated by an app controller. This part also doesn't seem to be controversial, at least not anymore after I changed to dynamically generated names 😉.

My plan for supporting autoscaling are numeric parameters.

Yes - in the break down, the template and numerical parameter functionality is combined into one KEP. That's what I meant that that KEP is the plan. What's "controversial" isn't the template API per se, but the way it introduces complexity with scheduling. The numerical parameters will reduce that considerably.

I agree it was too aggressive to suggest even the scoped down thing in 1.30 for beta. You may be right that we can postpone the debate since we are staying all in alpha. But if we want a chance of delivering the solution in smaller, digestible chunks, I think we have to work out the right API now, which I don't think is quite there yet even for the basic ResourceClaim.

My suggestion is that the user-owned resource claim API is under-specified as written, because instead of the user specifying the node, it randomly picks one during scheduling. So, it's sort of unusable in the manual flow except for network-attached resources. Before we automate something (i.e., add templating and automatic scheduling), we need the manual flow to work. And I do think if you give people an API that solves their use case, even with a little more manual prep-work / client-side work, people will use it.

Along those lines, the change is small. You just need to require the user to pick a node during the creation of the ResourceClaim (for non-network attached resources), and then users can pre-provision nodes with pools of associated resources, and labels those sets of nodes. This makes it an actual usable API, and makes the functionality composable: the automation (templates) builds directly on top of the manual process.

In fact, I think we can even push delayed allocation out-of-scope for the MVP, and still have something very useful. Typical UX would be:

This is a reasonable UX which will certainly be used. The scope of this much, much simpler and smaller than the current base DRA KEP.

sftim commented 7 months ago

We can build on https://github.com/kubernetes/enhancements/issues/3063#issuecomment-1916899355 with an focused follow-up change to PodSchedulingContext: one that allows kubelets to demur to accept the Pod for arbitrary reasons.

In other words, a kubelet could look at the existing attached resources, and the node as it's running right now, and inform the control plane that there's no such GPU, or that a different Pod is already using that NUMA partition, or that the phase of the moon is wrong…

At that stage, this doesn't need to mean clever scheduling and doesn't actually count as dynamically allocating any resources. Maybe all the candidate nodes decline and the scheduler eventually gives up trying. Cluster autoscalers wouldn't be trying to make new nodes because the nodeSelector serves as proof that it doesn't help. In this story, a Pod that doesn't find a home on any node doesn't find a home, and doesn't run.

It's basic. However, just as @johnbelamaric explained, it's useful to some folk. The ability for a kubelet to demur through an update to PodSchedulingContext would support a bunch of related user stories, even if there are many others that still need work.


If we go this route, where's a good place to take that discussion?

pohly commented 7 months ago

What's "controversial" isn't the template API per se, but the way it introduces complexity with scheduling.

I don't get how templates add complexity for scheduling. The scheduler needs to wait for the created ResourceClaim, but that's all. That's the same as "wait for user to create ResourceClaim", it doesn't make the scheduling more complex. Templates are not related to which nodes a picked.

My suggestion is that the user-owned resource claim API is under-specified as written, because instead of the user specifying the node, it randomly picks one during scheduling.

The "I want this claim for node xyz" doesn't need to be in the resource.k8s.io API/ResourceClaim API. It can go into the claim parameters for the driver. After all, it is the driver which needs to evaluate that information, right? If users must manually create ResourceClaims, then they can also create claim parameters for each of those.

Users create deployments or other workload controller resources to provision the pods, using a nodeSelector to map those to the set of nodes with the pre-provisioned resources.

So when a deployment is used, all pods reference the same ResourceClaim? Then all pods run on the same node, using the same hardware resource. I don't see how you intend to handle this. This will require some new kind of API, one which will become obsolete once we have what people really want (automatic scheduling). If you think that this is doable, then this deserves a separate KEP which explains all the details and what that API would look like. It's not just some reduced DRA KEP.

it's useful to some folk

Who are those folks? This seems very speculative to me.

The ability for a kubelet to demur through an update to PodSchedulingContext would support a bunch of related user stories

PodSchedulingContext is what people are trying to avoid...

pohly commented 7 months ago

If we go this route, where's a good place to take that discussion?

Write a provisional KEP, submit it. We can then meet at KubeCon EU to discuss face-to-face or set up online meetings.

johnbelamaric commented 7 months ago

So when a deployment is used, all pods reference the same ResourceClaim? Then all

Yeah, I think you're right, this doesn't quite work and templates are probably the fix.

The goal as you said is be to avoid pod scheduling context, not templates really.

I still think it's possible to create a scoped down but still useful API that accomplishes that.

johnbelamaric commented 7 months ago

Who are those folks? This seems very speculative to me.

Today people solve this by grabbing the whole node and/or running privileges pods. This API that avoids this, allowing an administrator to pre-allocate resources via the node-side (privileged) drivers, without requiring a the user pod to have those privileges. Those would be the users of this initial API.