kubernetes-sigs / jobset

JobSet: a k8s native API for distributed ML training and HPC workloads
https://jobset.sigs.k8s.io/
Apache License 2.0
131 stars 42 forks source link

Jobset minMember support #621

Open song-william opened 1 month ago

song-william commented 1 month ago

What would you like to be added: Does jobset support a concept like minMember in podgroups? We would like pods within a replicatedjob to be scheduled only if resources are available for all pods in the replicated jobs.

Why is this needed:

We have seen workloads that required >=2 pods only schedule one pod at first (e.g multinode pytorch). The scheduled pod then timesout waiting for the other pods to schedule.

googs1025 commented 1 month ago

/cc

danielvegamyhre commented 1 month ago

We would like pods within a replicatedjob to be scheduled only if resources are available for all pods in the replicated jobs.

@song-william for capacity-aware group scheduling behavior like this, we recommend using Kueue. JobSets are a natively supported workload type in Kueue. Here is an example of how to run a JobSet scheduled/managed via Kueue.

Here is another more involved example which shows step-by-step how to run large scale TPU Multislice training workloads as JobSets managed by Kueue, including step by step instructions for how to configure Kueue properly based on the actual accelerator resources available in the cluster.

danielvegamyhre commented 1 month ago

Closing for now since there seems to be no follow up question. Feel free to re-open if you want to discuss this further.

song-william commented 1 month ago

@danielvegamyhre We will be leveraging kueue on our cluster soon. Thanks for the response!

With the kueue installation, will kueue guarantee that pods are properly gang-scheduled (e.g minMember behavior)?

song-william commented 1 month ago

@danielvegamyhre it seems I don't have the permissions to reopen issues. https://stackoverflow.com/a/21333938

googs1025 commented 1 month ago

With the kueue installation, will kueue guarantee that pods are properly gang-scheduled (e.g minMember behavior)?

Perhaps we should move this issue to the kueue project.

song-william commented 1 month ago

FWIW, we have some simpler clusters where the cluster owners are only interested in proper gang-scheduling (e.g minMember) without the need for full quota controls (e.g kueue). I would have expected jobsets be able to handle this primitive without requiring a full queue/quota system installed.

danielvegamyhre commented 1 month ago

We would like pods within a replicatedjob to be scheduled only if resources are available for all pods in the replicated jobs.

@song-william This is only possible if you implement some form of capacity aware, all-or-nothing scheduling. This is a fairly complicated endeavor, and is applicable to more batch workload types than just JobSet. Therefore, our thinking was it makes more sense for this feature to live in Kueue, which is agnostic to the workload type, and therefore 1 implementation of gang-scheduling can support any batch workload submitted via Kueue.

However, I do understand the hesitancy add a new, complex dependency into your stack. Maybe we can think about if it makes sense to support some simple form of gang-scheduling in JobSet for cases like this. cc @alculquicondor

alculquicondor commented 4 weeks ago

Capacity awareness is not (and shouldn't be) a concern of the jobset project. This should be achieved by Kueue or other schedulers.