Open kerthcet opened 6 months ago
/sig scheduling /assign
Some designs are based on https://github.com/kubernetes/enhancements/issues/3370.
cc @cs20
Recently, @sanposhiho and I have opened our gang scheduler plugin, which we actually use in our cluster. Since I believe our approach is different from co-scheduling, it is valuable to take it into consideration. This plugin doesn't require custom resources such as PodGroups. Perhaps it has tips to improve this proposal. https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang
Thanks @utam0k I'll take a look but the link is invalid.
Thanks @utam0k I'll take a look but the link is invalid.
Sorry, I have fixed it
@kerthcet Our company used an custom plugin similar to co-scheduling (design-wise), and hit challenges. (I actually haven't followed this topic though,) if you're planning to introduce the current co-scheduling plugin implementation almost as is, we'd definitely hit the same too.
These are major challenges from our experience -
waitOnPermit
, while 1 Pod is rejected. 4 Pods reserves the places until the last Pod goes thru the scheduling cycle too (or reach timeout). Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :) But, on the other hand, I'm not saying, then we should follow our plugin's design. At its design phase, I had to solve them only by what is allowed within plugins; we didn't want to fork the scheduler. So, it should be much easier/simpler for sure if we could introduce changes in the scheduling framework itself to wisely support scheduling a group of Pods.
Thanks for the feedbacks @sanposhiho
Took a briefly look at the https://github.com/pfnet/scheduler-plugins/tree/master/plugins/gang, as the README.md highlights, the gang scheduler enhanced at:
Let's say a group has 5 Pods; 4 Pods go through the scheduling cycle and get on waitOnPermit, while 1 Pod is rejected. 4 Pods reserves the places until the last Pod goes thru the scheduling cycle too (or reach timeout).
If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result? Sorry, I didn't get your point here, but regarding to co-scheduling, it has similar logic as https://github.com/kubernetes-sigs/scheduler-plugins/blob/531a1831bdda0bdad6057f7d00f3cafacdb93d86/pkg/coscheduling/coscheduling.go#L152.
After all, I do agree we should reject the podGroup scheduling ASAP if we're 100% sure that the podGroup will not succeed at last.
As the number of groups in the cluster grows, it likely gets worse situation like a dead lock;
Yes, we should make sure the group Pods are queued up together. I may take a deep look of your approach about your implementations.
Difficulty in requeueing: For efficient requeueing, we should requeue a group of Pod when all Pods are ready to get schedule.
This is a more fine-gained approach focused on performance.
Our gang plugin overcomes those challenges, which is worth taking a look for you, hopefully :)
Definitely I will, thanks for sharing.
Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?
If you mean moving co-scheduling in tree, yes, that's something similar but more than that. Co-scheduling now has several problems like:
This is not the design defect for co-scheduling, but because kube-scheduler is unaware of a group of Pods as a unit. So what we hope to do is make scheduler aware of PodGroup, and users can build more plugins on this concept.
Simple configuration - Simplify is what we chased for but annotation is arbitrary, not good for validation and it's not that official, I may not follow this, but I got your points.
Yeah, regarding configuration via annotations, IMO we should avoid annotations if we intend to support it upstream. The native API would offer a much more robust solution.
If 4 Pods passed the scheduling cycle, shouldn't we wait for the last Pod's scheduling result?
We do need to wait, but the gang should give up being scheduled once one Pod is rejected + preemption doesn't help a rejected Pod's failure.
Basically, allowing Pods to wait in waitOnPermit
should be very minimal; as I mentioned earlier, pods in waitOnPermit
could result in them reserving nodes, which could easily lead to a situation where many nodes appear to have space (from outside scheduler pov), yet are essentially reserved, making other pending Pods unschedulable.
For our gang scheduling plugin, consider a scenario with a gang size of five. If four Pods are waiting for the fifth Pod in waitOnPermit, but the fifth Pod becomes unschedulable, we should immediately move all Pods back to the unschedulable queue. Up to this point, coscheduling does the same as you pointed out.
But, coscheduling is too simple; leaves many scenarios out of consideration; let's think about it further.
If we're using coscheduling, Pod1 - Pod4 have FailedPlugin: coscheduling
and Pod5 have FailedPlugin: hoge-plugin
. Then, Pod1-Pod4 would be requeued based on these events, Pod5 would be requeued based on hoge-plugin's registered events.
So, each Pod would be requeued individually.`, which is problematic. Let's say only Pod1-Pod4 are requeued,
waitOnPermit
until Pod5 comesLet's say only Pod5 is requeued,
waitOnPermit
, Pod1-Pod4 could be schedulable.So -
Yes, we should make sure the group Pods are queued up together.
we concluded the same ;) We should not regard Pod1-Pod4 as rejected by coscheduling, but regard them as rejected because of "Pod-5's failure". Our plugin requeues all of pods at once when some cluster event happens and hoge-plugin might change the result for Pod5. (= at the latest scheduler, it's when QHint of hoge-plugin returns Queue for Pod5)
Kubernetes already supports pluggable schedulers. Should this feature be delivered in-tree?
I've got the same feeling actually. We've put coscheduling in sigs/scheduler-plugins until now, what's a motivation we're moving it to in-tree plugins now?
(As I said, of course, a plugin implementation would be simpler if it's supported as in-tree plugin and we change the scheduler implementation for it though) Technically, our plugin shows that we can implement all the sophisticated tricks described in my comments (plus more) as a custom plugin, without requiring any implementation change in the scheduler side.
Mentioned several points above, regarding to the fact of another gang scheduling implementation, we should support this in the upstream to avoid the reimplementations again and again and again.
I just commented for that point in your doc, what's the actual pain point for us from the situation? Why do we want everyone to use the same gang scheduling solution? Can't we just say that we officially maintain coscheduling, and we don't care about the others maintained by other communities?
Why do we want everyone to use the same gang scheduling solution?
That's a good question, I think what we want is not force everyone to use the same gang scheduling solution, but make it extendable since people may want different queueing or preemption logic, but at the same time, we should provide a standard gang scheduling primitive for users, that doesn't mean it's the best one.
OTOH, I think all solutions have the same goal as making podGroup scheduling efficient, that's what we can plumb into the schedulingQueue as we do with activeQ/backoffQ/unschedulablePods.
I'll make co-scheduling with native scheduler as an Alternative, will append to the proposal later.
Hope to hear other advices about how to make podGroup extendable, do you have any advices with your gang scheduler plugin @sanposhiho
Why do we want everyone to use the same gang scheduling solution?
We need to standardize the API, at the very least.
Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.
Why do we want everyone to use the same gang scheduling solution?
We need to standardize the API, at the very least.
Furthermore, we should have an in-tree plugin that is compatible with other sig-scheduling projects, primarily Kueue.
Also, the in-tree plugin would be worth it when we support the gnang-scheduling in the JobSet and LeaderWorkerSet (sig-apps sub-projects).
Compatible with other subprojects should be another goal. Will add to the proposal as well.
Compatible with other subprojects should be another goal. Will add to the proposal as well.
I didn't mean that we should work on integrations with subprojects. I just raised use cases.
Integration is out-of-goal, what I refer to is compatibility.
Integration is out-of-goal, what I refer to is compatibility.
Yes, I wanted to mean what you say.
I DIDN'T mean that we should work on integrations with subprojects. I just raised use cases.
OK, so from my eyes, it looks like
To me, those 2 motivations are equally important.
Hopefully, in the future, people don't need to implement custom schedulers to get all-or-nothing scheduling.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
I will revisit this later, and target this for next release.
cc @mwielgus
cc @ahg-g @thockin
/cc
/cc
/cc
Enhancement Description
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.