kubernetes-sigs / jobset

JobSet: a k8s native API for distributed ML training and HPC workloads
https://jobset.sigs.k8s.io/
Apache License 2.0
139 stars 46 forks source link

Add concept doc on how exclusive placement works #458

Open danielvegamyhre opened 6 months ago

danielvegamyhre commented 6 months ago

What would you like to be added:

Add concept doc on how exclusive placement works

Why is this needed:

Users often have questions about how it works and it is helpful info for them to be able to debug issues on their own

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

danielvegamyhre commented 2 months ago

/remove-lifecycle stale

danielvegamyhre commented 2 months ago

/remove-lifecycle rotten

googs1025 commented 2 months ago

/assign I will try this!

googs1025 commented 2 months ago

Is this issue meant to demonstrate how to use and operate, or is it to be explained through code?

danielvegamyhre commented 2 months ago

__

Is this issue meant to demonstrate how to use and operate, or is it to be explained through code?

It's meant to explain how exclusive placement is implemented. I.e., a leader pod (index 0) from each Job is created, with pod affinity/anti-affinity rules ensuring only 1 leader pod lands in each topology domain (e.g., rack, node pool, etc). All follower pods (non-0 indexes) are blocked from creation until their corresponding leader pod has been scheduled. Once a leader pod from a job is scheduled, the follower pods in that Job will be created and have nodeSelectors injected which ensure they land in the same topology domain as the leader.

googs1025 commented 2 months ago

__

Is this issue meant to demonstrate how to use and operate, or is it to be explained through code?

It's meant to explain how exclusive placement is implemented. I.e., a leader pod (index 0) from each Job is created, with pod affinity/anti-affinity rules ensuring only 1 leader pod lands in each topology domain (e.g., rack, node pool, etc). All follower pods (non-0 indexes) are blocked from creation until their corresponding leader pod has been scheduled. Once a leader pod from a job is scheduled, the follower pods in that Job will be created and have nodeSelectors injected which ensure they land in the same topology domain as the leader.

OK, I understand, I will also demonstrate with examples