Gang Scheduling Support in Kubernetes

kerthcet commented 6 months ago

Enhancement Description

One-line enhancement description (can be used as a release note): Support gang scheduling primitive in Kubernetes
Kubernetes Enhancement Proposal:
Discussion Link: https://docs.google.com/document/d/1q4a8uB_he2gx_lB2YFxsGaSVqMtF6CGJBhFkvArwnd4/edit?usp=sharing
Primary contact (assignee): @kerthcet
Responsible SIGs: sig-scheduling
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.32
- Beta release target (x.y):
- Stable release target (x.y):
[ ] Alpha
- [ ] KEP (k/enhancements) update PR(s):
- [ ] Code (k/k) update PR(s):
- [ ] Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

kerthcet commented 6 months ago

/sig scheduling /assign

kerthcet commented 6 months ago

Some designs are based on https://github.com/kubernetes/enhancements/issues/3370.

alculquicondor commented 6 months ago

cc @cs20

utam0k commented 6 months ago

Recently, @sanposhiho and I have opened our gang scheduler plugin, which we actually use in our cluster. Since I believe our approach is different from co-scheduling, it is valuable to take it into consideration. This plugin doesn't require custom resources such as PodGroups. Perhaps it has tips to improve this proposal. https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang

kerthcet commented 6 months ago

Thanks @utam0k I'll take a look but the link is invalid.

utam0k commented 6 months ago

Thanks @utam0k I'll take a look but the link is invalid.

Sorry, I have fixed it

sanposhiho commented 6 months ago

@kerthcet Our company used an custom plugin similar to co-scheduling (design-wise), and hit challenges. (I actually haven't followed this topic though,) if you're planning to introduce the current co-scheduling plugin implementation almost as is, we'd definitely hit the same too.

These are major challenges from our experience -

Inefficient scheduling: waiting Pods reserve too large space in a cluster.
- Let's say a group has 5 Pods; 4 Pods go through the scheduling cycle and get on waitOnPermit, while 1 Pod is rejected. 4 Pods reserves the places until the last Pod goes thru the scheduling cycle too (or reach timeout).
- As the number of groups in the cluster grows, it likely gets worse situation like a dead lock; many Pods would be unschedulable while the cluster has plenty space because a few Pods in each group reserve space and wait for the rest of Pods, while the rest of Pods cannot go through the scheduling cycle as many space is reserved.
Difficulty in requeueing: For efficient requeueing, we should requeue a group of Pod when all Pods are ready to get schedule.
- Let's say a group has 5 Pods; one Pod is rejected by resource fit, another Pod is rejected by NodeAffinity, three Pods are schedulable (they have to wait for the other 2 Pods). In this case, we have to requeue all 5 Pods when both resource fit plugin's failure for Pod-1 and NodeAffinity plugin failure for Pod-2 are solved.

Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :) But, on the other hand, I'm not saying, then we should follow our plugin's design. At its design phase, I had to solve them only by what is allowed within plugins; we didn't want to fork the scheduler. So, it should be much easier/simpler for sure if we could introduce changes in the scheduling framework itself to wisely support scheduling a group of Pods.

kerthcet commented 6 months ago

Thanks for the feedbacks @sanposhiho

Took a briefly look at the https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang, as the README.md highlights, the gang scheduler enhanced at:

Simple configuration - Simplify is what we chased for but annotation is arbitrary, not good for validation and it's not that official, I may not follow this, but I got your points.
Enhanced requeueing - This is something we should consider when we're dedicated to solve the performance problem.

Let's say a group has 5 Pods; 4 Pods go through the scheduling cycle and get on waitOnPermit, while 1 Pod is rejected. 4 Pods reserves the places until the last Pod goes thru the scheduling cycle too (or reach timeout).

If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result? Sorry, I didn't get your point here, but regarding to co-scheduling, it has similar logic as https://github.com/kubernetes-sigs/scheduler-plugins/blob/531a1831bdda0bdad6057f7d00f3cafacdb93d86/pkg/coscheduling/coscheduling.go#L152.

After all, I do agree we should reject the podGroup scheduling ASAP if we're 100% sure that the podGroup will not succeed at last.

As the number of groups in the cluster grows, it likely gets worse situation like a dead lock;

Yes, we should make sure the group Pods are queued up together. I may take a deep look of your approach about your implementations.

Difficulty in requeueing: For efficient requeueing, we should requeue a group of Pod when all Pods are ready to get schedule.

This is a more fine-gained approach focused on performance.

Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :)

Definitely I will, thanks for sharing.

sftim commented 6 months ago

Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?

kerthcet commented 6 months ago

If you mean moving co-scheduling in tree, yes, that's something similar but more than that. Co-scheduling now has several problems like:

Stateful, however, plugin should better be stateless(each scheduling cycle) and lightweight
Queueing problems for PodGroups
Provide additional maintenance for some components like another backoffQ
... maybe some other problems like performance mentioned above

This is not the design defect for co-scheduling, but because kube-scheduler is unaware of a group of Pods as a unit. So what we hope to do is make scheduler aware of PodGroup, and users can build more plugins on this concept.

sanposhiho commented 6 months ago

Simple configuration - Simplify is what we chased for but annotation is arbitrary, not good for validation and it's not that official, I may not follow this, but I got your points.

Yeah, regarding configuration via annotations, IMO we should avoid annotations if we intend to support it upstream. The native API would offer a much more robust solution.

If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result?

We do need to wait, but the gang should give up being scheduled once one Pod is rejected + preemption doesn't help a rejected Pod's failure. Basically, allowing Pods to wait in waitOnPermit should be very minimal; as I mentioned earlier, pods in waitOnPermit could result in them reserving nodes, which could easily lead to a situation where many nodes appear to have space (from outside scheduler pov), yet are essentially reserved, making other pending Pods unschedulable.

For our gang scheduling plugin, consider a scenario with a gang size of five. If four Pods are waiting for the fifth Pod in waitOnPermit, but the fifth Pod becomes unschedulable, we should immediately move all Pods back to the unschedulable queue. Up to this point, coscheduling does the same as you pointed out.

But, coscheduling is too simple; leaves many scenarios out of consideration; let's think about it further. If we're using coscheduling, Pod1 - Pod4 have FailedPlugin: coscheduling and Pod5 have FailedPlugin: hoge-plugin. Then, Pod1-Pod4 would be requeued based on these events, Pod5 would be requeued based on hoge-plugin's registered events.

So, each Pod would be requeued individually.`, which is problematic. Let's say only Pod1-Pod4 are requeued,

they again wouldn't go beyond waitOnPermit until Pod5 comes
they wouldn't get back to unschedQ until Pod5 is requeued (or timeout-ed) because no one from the same gang goes thru PostFilter.

Let's say only Pod5 is requeued,

when Pod5 is requeued and reaches waitOnPermit, Pod1-Pod4 could be schedulable.
but actually Pod1-Pod4 won't be requeued until coscheduling's events happen.

So -

Yes, we should make sure the group Pods are queued up together.

we concluded the same ;) We should not regard Pod1-Pod4 as rejected by coscheduling, but regard them as rejected because of "Pod-5's failure". Our plugin requeues all of pods at once when some cluster event happens and hoge-plugin might change the result for Pod5. (= at the latest scheduler, it's when QHint of hoge-plugin returns Queue for Pod5)

sanposhiho commented 6 months ago

Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?

I've got the same feeling actually. We've put coscheduling in sigs/scheduler-plugins until now, what's a motivation we're moving it to in-tree plugins now?

(As I said, of course, a plugin implementation would be simpler if it's supported as in-tree plugin and we change the scheduler implementation for it though) Technically, our plugin shows that we can implement all the sophisticated tricks described in my comments (plus more) as a custom plugin, without requiring any implementation change in the scheduler side.

kerthcet commented 6 months ago

Mentioned several points above, regarding to the fact of another gang scheduling implementation, we should support this in the upstream to avoid the reimplementations again and again and again.

sanposhiho commented 6 months ago

I just commented for that point in your doc, what's the actual pain point for us from the situation? Why do we want everyone to use the same gang scheduling solution? Can't we just say that we officially maintain coscheduling, and we don't care about the others maintained by other communities?

kerthcet commented 6 months ago

Why do we want everyone to use the same gang scheduling solution?

That's a good question, I think what we want is not force everyone to use the same gang scheduling solution, but make it extendable since people may want different queueing or preemption logic, but at the same time, we should provide a standard gang scheduling primitive for users, that doesn't mean it's the best one.

OTOH, I think all solutions have the same goal as making podGroup scheduling efficient, that's what we can plumb into the schedulingQueue as we do with activeQ/backoffQ/unschedulablePods.

I'll make co-scheduling with native scheduler as an Alternative, will append to the proposal later.

Hope to hear other advices about how to make podGroup extendable, do you have any advices with your gang scheduler plugin @sanposhiho

alculquicondor commented 6 months ago

Why do we want everyone to use the same gang scheduling solution?

We need to standardize the API, at the very least.

Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.

tenzen-y commented 6 months ago

Why do we want everyone to use the same gang scheduling solution?

We need to standardize the API, at the very least.

Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.

Also, the in-tree plugin would be worth it when we support the gnang-scheduling in the JobSet and LeaderWorkerSet (sig-apps sub-projects).

kerthcet commented 6 months ago

Compatible with other subprojects should be another goal. Will add to the proposal as well.

tenzen-y commented 6 months ago

Compatible with other subprojects should be another goal. Will add to the proposal as well.

I didn't mean that we should work on integrations with subprojects. I just raised use cases.

kerthcet commented 6 months ago

Integration is out-of-goal, what I refer to is compatibility.

tenzen-y commented 6 months ago

Integration is out-of-goal, what I refer to is compatibility.

Yes, I wanted to mean what you say.

I DIDN'T mean that we should work on integrations with subprojects. I just raised use cases.

sanposhiho commented 5 months ago

OK, so from my eyes, it looks like

The standardization of API and making the vanilla scheduler compatible with various subprojects is the first, core motivation.
- Without this, requiring a custom scheduler with coscheduling plugin would be a requirement for various subprojects, which would be a troublesome hurdle.
Other technical reasons, such as a better gang plugin implementation by introducing some changes in the scheduler core, come as a second reason.
- It couldn't be the first motivation since, as I said, technically we can implement sophisticated gang scheduling as out-of-tree plugins (with some tricks, though ;)).

alculquicondor commented 5 months ago

To me, those 2 motivations are equally important.

Hopefully, in the future, people don't need to implement custom schedulers to get all-or-nothing scheduling.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

googs1025 commented 1 month ago

/remove-lifecycle stale

kerthcet commented 1 month ago

I will revisit this later, and target this for next release.

alculquicondor commented 1 month ago

cc @mwielgus

alculquicondor commented 1 month ago

cc @ahg-g @thockin

wojtek-t commented 1 month ago

/cc

mrunalp commented 1 month ago

/cc

x13n commented 2 weeks ago

/cc

kubernetes / enhancements

Gang Scheduling Support in Kubernetes #4671

Enhancement Description