DRA: Support scoring for devices and nodes in scheduling

johnbelamaric commented 1 day ago

Enhancement Description

DRA supports the concept of "under specifying" a request. This gives the scheduler more flexibility to satisfy a request, increasing the likelihood of success in environments with scarce resources. For example, rather than asking for a specific model of a device, the user can ask for any one of a set of models, as long as it has some minimum specified amount of memory.

Currently, DRA uses a "first fit" algorithm during scheduling. This can lead to inefficient choices. Building on the example above, if the user asks for a device with at least 4GB of memory, if the first device found has 80GB of memory, it will be chosen, even if there is another option with exactly 4GB. If scoring were available, the scheduler could evaluate the "waste" associated with each possible option, and make a more efficient choice.

Scoring is also critical in other situations where there is optionality in how to satisfy a request. For instance, in #4816 the user is allowed to provide a list of preferences. While that works to choose the "best" option on a given node, in reality most nodes have homogeneous selections of devices. So, in a cluster with nodes that meet the first option in the list, and nodes that meet the second option in the list, either one could be chosen. If DRA could score the nodes based on whether they satisfy the first or second option, then preference could be given to the first option across nodes, not just within a node.

The last important place for scoring is to help with fragmentation and bin packing. This is especially relevant with the implementation of #4815 pending. The ability to dynamical choose partitions of a device can lead to fragmentation; scoring can alleviate that to some extent. It is not a complete solution to that problem, but can help.

One-line enhancement description (can be used as a release note): Scoring support for improved Node and device selection during Pod scheduling
Kubernetes Enhancement Proposal: TBD
Discussion Link: TBD (live discussions at KubeCon)
Primary contact (assignee): @johnbelamaric
Responsible SIGs: sig-scheduling
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.33
- Beta release target (x.y): 1.34
- Stable release target (x.y): 1.35
[ ] Alpha
- [ ] KEP (k/enhancements) update PR(s):
- [ ] Code (k/k) update PR(s):
- [ ] Docs (k/website) update PR(s):

/assign @johnbelamaric /cc @pohly @klueska @mortent @alculquicondor @wojtek-t /sig scheduling

johnbelamaric commented 1 day ago

/wg device-management

alculquicondor commented 1 day ago

cc @dom4ha

johnbelamaric commented 1 day ago

cc @asm582 @kannon92

Abhishek, I don't seem to have Olivier's github handle, please cc him as well.

johnbelamaric commented 1 day ago

cc @catblade

kannon92 commented 1 day ago

cc @tardieu

wojtek-t commented 1 day ago

Thanks for filing that John. Let me just share my high-level thoughts about it.

I obviously sympathize with the goal and usecase, but I would like us to spend time thinking about implementation. I think the primary option for many k8s folks would be to somehow incorporate it into existing scheduler model, namely into the concept of priorities in the scheduler. Saying that I'm not a fan of this mode would be an euphenism - the scoring model of scheduler has in my opinion two primary problems (and a number of smaller ones):

1) it's completely unintuitive for the and user and it's pretty much impossible to reason about Given that the score is an affinity combination of individual scoring functions, as a user (despite the fact that I worked on that codebase in the past), having a set of feasible nodes, I can't predict which node would finally be chosen, because different functions are pulling in different directions.

2) it was historically causing (and still somewhat is) a bunch of pain from performance/scalability perspective With introduction of even more complex rules, we would only add to this problem.

I didn't yet put enough thoughts into it, but I would like to explore a very different model of a cleaner decision tree. In that model, we stack-rank the preferences and on a given level we immediately reject all options with a score that is different than the highest for this particular scoring function. This immediately solve the problem of intuitiveness and reasoning about the choice, but would also help with performance as in majority of cases we don't need to compute all scores for all nodes.

Obviously we have a compatibility problem, but given we're talking about new feature maybe it's the right moment to seriously consider it now. And once we prove it, maybe we will be able to somehow get rid the old model in the future...

johnbelamaric commented 14 hours ago

Thanks @wojtek-t, good points. A few thoughts.

I would characterize your first point as "predictability" - as a user I have an idea of what's going to happen. We can try to make that a goal. One factor to consider is who needs to influence the decision, and how that affects predictability. I can think of (at least) four different roles that probably want some say in the scoring:

The end user - they may want to prioritize for performance or cost, for example.
The cluster admin - they will want to control how their infrastructure is used. Scoring can be one mechanism to influence that (for example, should we have a way for them to prioritize one device type over another, for cost/contractual reasons?).
The cloud provider / node architect - they may have some reasons to affect placement to ensure the best bin packing on their infrastructure, or to ensure the best performance
The device vendor - at the very least, they would be the source of performance data (device X is faster than device Y), if that becomes a scoring dimension

During the KEP process we will have to figure which of these we want to address, what the scope of control they should have, how to manage weights given to each, what APIs they need to influence decisions, and how to make all that comprehensible and predictable. It should be fun :)

On your second point, were those issues there for solely local decisions (scoring nodes for a single Pod), or did they arise out of things like pod affinity? One thing I would put explicitly out-of-scope for this KEP is any kind of cross-pod affinity or gang scheduling. I think we need a different solution than our existing kube-scheduler for that. As a corollary, I also don't want to address optimizing for future workloads that may be coming along soon (a la Kueue). For example, if I know that the next workload will need a full GPU, and I can fit the current workload on a MIG on an in-use GPU, but with slightly less performance, ideally I would probably choose the MIG. But in this KEP, I don't want to try to do that. I think it's already hard enough when we only make decisions for the single Pod we're looking at. That said, I think that whatever we do should be useful to solutions like Kueue or a multi-Pod scheduler to leverage as needed.

johnbelamaric commented 13 hours ago

cc @jingxu97

dom4ha commented 12 hours ago

If we want to support both the end user preference and cluster admin policy (to maintain high utilization), we probably cannot just stack rank, as we may end up either listening to the end user preference unconditionally, or follow admin policy ignoring the user preference (at least in the simplest implementation).

Tuning weights might be indeed challenging, but on the other hand reaching higher predictability (sacrificing flexibility) could be achieved by picking more radical weights without changing the whole model. For instance all kinds of DRA factors could have much stronger weights than all other non-DRA factors, without stack ranking all existing plugin types. I bet that scheduler uses flat weights because it's hard to definitely point out which factors are more and which are less important, so unpredictability at some level is probably inevitable.

Another small comment is that Autoscaler skips scoring, so it may give suboptimal predictions compared to the real scheduler future placements, but I'm not sure how much Autoscaler accuracy is a concern at this point.

wojtek-t commented 4 hours ago

Thanks for your thoughts, so let me share my further thoughts (some of which I already had before, some are triggered by above comments).

I would characterize your first point as "predictability"

Yes - it's exactly that. And I fully agree that there are multiple personas that would potentially like to affect that. In my mental model I had only 3 (workload owner, cluster admin and provider), but indeed we may need to think if provider shouldn't be split. My thinking about how those 3 affect the decisions are:

1) provider - exposes what "scoring function" are actually available for cluster admins and workload users 2) cluster-admin - defines the default order on those functions and potentially some additional constraints (e.g. function X can't be used at all, function A has to be more important than B, ...) 3) workload owner - can somehow affect the default order of scoring functions within the constraints provided by the admin

I'm happy to hear other thoughts on it, but this is how I was thinking about it.

On your second point, were those issues there for solely local decisions (scoring nodes for a single Pod), or did they arise out of things like pod affinity?

It's a bit of both:

for non-local decisions (like pod affinity) these are obviously much harder - and I'm fine with making these our of scope for now
for local decisions, if we have only a single node that is strictly better than any other wrt the most important scoring function, then we also potentially lose a bunch of resources computing all O(10) scores for every node;

As a corollary, I also don't want to address optimizing for future workloads that may be coming along soon (a la Kueue)

I have a bunch of thoughts about that too, but I'm fine with making at out-of-scope. If we build a reasonable building block here, it should be enough. And given we want to focus purely on "intra-node" aspect, I think we're fine here.

Tuning weights might be indeed challenging, but on the other hand reaching higher predictability (sacrificing flexibility) could be achieved by picking more radical weights without changing the whole model.

Technically that's true, but still see my performance point above - we may unnecessary use order of magnitude more resources in that model. If we want predictability, let's not hack around to achieve it, but rather let's do that properly.

Another small comment is that Autoscaler skips scoring, so it may give suboptimal predictions compared to the real scheduler future placements, but I'm not sure how much Autoscaler accuracy is a concern at this point.

CA is a bit different here, because it doesn't look only on a single node. It has an internal concept of scoring, but it's a bit different. I would like to achieve better unification here, but for the sake of making progress, I would also put it out of scope of now. @x13n - for your thoughts if you disagree

kubernetes / enhancements

DRA: Support scoring for devices and nodes in scheduling #4970

Enhancement Description