volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.24k stars 971 forks source link

Any way to know result of other plugins' predicateFn results within another plugin? #3736

Open qGentry opened 2 months ago

qGentry commented 2 months ago

Please describe your problem in detail

Hi guys, I've recently created this issue https://github.com/volcano-sh/volcano/issues/3712 about how to handle cases when I have multiple InfiniBand clusters within single K8s. Basically, I want for pods in podgroup to schedule only within single IB clusters.

I've implemented such plugin and it seem to work properly on cases I've described here https://github.com/volcano-sh/volcano/issues/3712#issuecomment-2337737446

I'm new to volcano codebase and have limited experience with golang, so it probably far from ideal, here it is: https://github.com/qGentry/volcano/blob/feature-ibclusters-plugin/pkg/scheduler/plugins/clusters/clusters.go#L53

During implementation I've had to do some weird hack which I don't really like.

Let me first explain how this plugin works, plugin just implements predicateFn as following:

  1. Extract from ssn minAvailable replicas for this task and all nodes in cluster.
  2. Then, based on nodes' labels, create a map {cluster_name: []node}.
  3. Filter clusters so only clusters who has len(nodes) >= minAvaiable are left
  4. Randomly select cluster
  5. For given node, check if it is in randomly selected cluster, if is, allowed scheduling, otherwise - don't allow.

Weird thing here is random cluster selection, it works for me as I have only 3 or 4, but this would really slow down scheduling if I would have 10+ clusters. If instead of randomly selecting cluster I would do it deterministically, for example, first one in sorted list, this wouldn't work as other plugin might have filtered nodes in first cluster and actually amount of available nodes is smaller than minAvailable.

Ideal Implementation would be if somehow my plugin would already know which nodes are not filtered by other plugins/taints/label/resources selector, and use this information to filter these nodes when creating map {cluster_name: []node}.

Do you guys know to can I improve this plugin and maybe even upstream it?

Any other relevant information

No response

JesseStutler commented 2 months ago

Maybe you can add your logic in https://github.com/volcano-sh/volcano/blob/0843c0d33fccdb85439dc7086dd7cea061070901/pkg/scheduler/framework/session_plugins.go#L630-L647 to add some field in session, and then get it from your plugin. Or you can modify other plugins' predicateFn to implement your feature.

qGentry commented 2 months ago

I believe that it would implicate specific ordering on calling plugins' predicateFn which might be unwanted behavior, because if we'll allow it for one plugin, then we have to allow it for all plugins, and if two plugins have ordering "very last", we'll get ambiguity.

Another way I was thinking of it is the following: In my plugin's predicateFn I can access ssn.predicateFns and iterate through them just like it happens in the snippet you've provided and check if nodes were already filtered by other plugins (basically run every predicateFn except itself). Problem is that if we'll allow to do so and another plugin will do the same, we'll get infinite recursion.

Monokaix commented 2 months ago

Hi @qGentry, thanks for your feedback. The best way is modifying the logic in allocate action and allocate can traverse all node groups and then select a best one that can fit all tasks because allocate knows all the final predict result, which we will implement that based on the https://github.com/volcano-sh/volcano/pull/3388 in next version(ideally). The hack way is that you can put your plugin at the end of scheduler configuration in configMap, if scheduler runs your plugin, it means other pluings returned success and have no disturbance to you. And let plugin be aware of other plugin's behavior is not a good way: )

Monokaix commented 2 months ago

cc @lowang-bh