kubernetes / enhancements

Enhancements tracking repo for Kubernetes
Apache License 2.0
3.34k stars 1.44k forks source link

Gang Scheduling Support in Kubernetes #4671

Open kerthcet opened 1 month ago

kerthcet commented 1 month ago

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

kerthcet commented 1 month ago

/sig scheduling /assign

kerthcet commented 1 month ago

Some designs are based on https://github.com/kubernetes/enhancements/issues/3370.

alculquicondor commented 1 month ago

cc @cs20

utam0k commented 1 month ago

Recently, @sanposhiho and I have opened our gang scheduler plugin, which we actually use in our cluster. Since I believe our approach is different from co-scheduling, it is valuable to take it into consideration. This plugin doesn't require custom resources such as PodGroups. Perhaps it has tips to improve this proposal. https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang

kerthcet commented 1 month ago

Thanks @utam0k I'll take a look but the link is invalid.

utam0k commented 1 month ago

Thanks @utam0k I'll take a look but the link is invalid.

Sorry, I have fixed it

sanposhiho commented 1 month ago

@kerthcet Our company used an custom plugin similar to co-scheduling (design-wise), and hit challenges. (I actually haven't followed this topic though,) if you're planning to introduce the current co-scheduling plugin implementation almost as is, we'd definitely hit the same too.

These are major challenges from our experience -


Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :) But, on the other hand, I'm not saying, then we should follow our plugin's design. At its design phase, I had to solve them only by what is allowed within plugins; we didn't want to fork the scheduler. So, it should be much easier/simpler for sure if we could introduce changes in the scheduling framework itself to wisely support scheduling a group of Pods.

kerthcet commented 1 month ago

Thanks for the feedbacks @sanposhiho

Took a briefly look at the https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang, as the README.md highlights, the gang scheduler enhanced at:

Let's say a group has 5 Pods; 4 Pods go through the scheduling cycle and get on waitOnPermit, while 1 Pod is rejected. 4 Pods reserves the places until the last Pod goes thru the scheduling cycle too (or reach timeout).

If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result? Sorry, I didn't get your point here, but regarding to co-scheduling, it has similar logic as https://github.com/kubernetes-sigs/scheduler-plugins/blob/531a1831bdda0bdad6057f7d00f3cafacdb93d86/pkg/coscheduling/coscheduling.go#L152.

After all, I do agree we should reject the podGroup scheduling ASAP if we're 100% sure that the podGroup will not succeed at last.

As the number of groups in the cluster grows, it likely gets worse situation like a dead lock;

Yes, we should make sure the group Pods are queued up together. I may take a deep look of your approach about your implementations.

Difficulty in requeueing: For efficient requeueing, we should requeue a group of Pod when all Pods are ready to get schedule.

This is a more fine-gained approach focused on performance.

Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :)

Definitely I will, thanks for sharing.

sftim commented 1 month ago

Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?

kerthcet commented 1 month ago

If you mean moving co-scheduling in tree, yes, that's something similar but more than that. Co-scheduling now has several problems like:

This is not the design defect for co-scheduling, but because kube-scheduler is unaware of a group of Pods as a unit. So what we hope to do is make scheduler aware of PodGroup, and users can build more plugins on this concept.

sanposhiho commented 1 month ago

Simple configuration - Simplify is what we chased for but annotation is arbitrary, not good for validation and it's not that official, I may not follow this, but I got your points.

Yeah, regarding configuration via annotations, IMO we should avoid annotations if we intend to support it upstream. The native API would offer a much more robust solution.

If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result?

We do need to wait, but the gang should give up being scheduled once one Pod is rejected + preemption doesn't help a rejected Pod's failure. Basically, allowing Pods to wait in waitOnPermit should be very minimal; as I mentioned earlier, pods in waitOnPermit could result in them reserving nodes, which could easily lead to a situation where many nodes appear to have space (from outside scheduler pov), yet are essentially reserved, making other pending Pods unschedulable.

For our gang scheduling plugin, consider a scenario with a gang size of five. If four Pods are waiting for the fifth Pod in waitOnPermit, but the fifth Pod becomes unschedulable, we should immediately move all Pods back to the unschedulable queue. Up to this point, coscheduling does the same as you pointed out.

But, coscheduling is too simple; leaves many scenarios out of consideration; let's think about it further. If we're using coscheduling, Pod1 - Pod4 have FailedPlugin: coscheduling and Pod5 have FailedPlugin: hoge-plugin. Then, Pod1-Pod4 would be requeued based on these events, Pod5 would be requeued based on hoge-plugin's registered events.

So, each Pod would be requeued individually.`, which is problematic. Let's say only Pod1-Pod4 are requeued,

Let's say only Pod5 is requeued,

So -

Yes, we should make sure the group Pods are queued up together.

we concluded the same ;) We should not regard Pod1-Pod4 as rejected by coscheduling, but regard them as rejected because of "Pod-5's failure". Our plugin requeues all of pods at once when some cluster event happens and hoge-plugin might change the result for Pod5. (= at the latest scheduler, it's when QHint of hoge-plugin returns Queue for Pod5)

sanposhiho commented 1 month ago

Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?

I've got the same feeling actually. We've put coscheduling in sigs/scheduler-plugins until now, what's a motivation we're moving it to in-tree plugins now?

(As I said, of course, a plugin implementation would be simpler if it's supported as in-tree plugin and we change the scheduler implementation for it though) Technically, our plugin shows that we can implement all the sophisticated tricks described in my comments (plus more) as a custom plugin, without requiring any implementation change in the scheduler side.

kerthcet commented 1 month ago

Mentioned several points above, regarding to the fact of another gang scheduling implementation, we should support this in the upstream to avoid the reimplementations again and again and again.

sanposhiho commented 1 month ago

I just commented for that point in your doc, what's the actual pain point for us from the situation? Why do we want everyone to use the same gang scheduling solution? Can't we just say that we officially maintain coscheduling, and we don't care about the others maintained by other communities?

kerthcet commented 1 month ago

Why do we want everyone to use the same gang scheduling solution?

That's a good question, I think what we want is not force everyone to use the same gang scheduling solution, but make it extendable since people may want different queueing or preemption logic, but at the same time, we should provide a standard gang scheduling primitive for users, that doesn't mean it's the best one.

OTOH, I think all solutions have the same goal as making podGroup scheduling efficient, that's what we can plumb into the schedulingQueue as we do with activeQ/backoffQ/unschedulablePods.

I'll make co-scheduling with native scheduler as an Alternative, will append to the proposal later.

Hope to hear other advices about how to make podGroup extendable, do you have any advices with your gang scheduler plugin @sanposhiho

alculquicondor commented 1 month ago

Why do we want everyone to use the same gang scheduling solution?

We need to standardize the API, at the very least.

Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.

tenzen-y commented 1 month ago

Why do we want everyone to use the same gang scheduling solution?

We need to standardize the API, at the very least.

Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.

Also, the in-tree plugin would be worth it when we support the gnang-scheduling in the JobSet and LeaderWorkerSet (sig-apps sub-projects).

kerthcet commented 1 month ago

Compatible with other subprojects should be another goal. Will add to the proposal as well.

tenzen-y commented 1 month ago

Compatible with other subprojects should be another goal. Will add to the proposal as well.

I didn't mean that we should work on integrations with subprojects. I just raised use cases.

kerthcet commented 1 month ago

Integration is out-of-goal, what I refer to is compatibility.

tenzen-y commented 1 month ago

Integration is out-of-goal, what I refer to is compatibility.

Yes, I wanted to mean what you say.

I DIDN'T mean that we should work on integrations with subprojects. I just raised use cases.

sanposhiho commented 1 month ago

OK, so from my eyes, it looks like

alculquicondor commented 1 month ago

To me, those 2 motivations are equally important.

Hopefully, in the future, people don't need to implement custom schedulers to get all-or-nothing scheduling.