kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.48k stars 265 forks source link

TAS: API to support rank-based ordering for custom CRDs #3663

Open mimowo opened 2 days ago

mimowo commented 2 days ago

What would you like to be added:

API which allows to use custom PodIndex labels for custom CRD jobs, without the incentive to use labels reserved for kubernetes in the in-house Jobs.

Why is this needed:

Completion requirements:

mimowo commented 2 days ago

The proposal is to extend the workload PodSetTopologyRequest API with the following fields:

// PodIndexLabel indicates the name of the label indexing the pods. 
// For example, in the context of
// - kubernetes job this is: kubernetes.io/job-completion-index
// - JobSet: kubernetes.io/job-completion-index (inherited from Job)
// - Kubeflow: training.kubeflow.org/replica-index
PodIndexLabel *string

// SubGroupIndexLabel indicates the name of the label indexing the instances of replicated Jobs (groups)
// within a PodSet. For example, in the context of JobSet this is jobset.sigs.k8s.io/job-index.
SubGroupIndexLabel *string

// SubGroupIndexLabel indicates the count of replicated Jobs (groups) within a PodSet.
// For example, in the context of JobSet this value is read from jobset.sigs.k8s.io/replicatedjob-replicas.
SubGroupCount *int32

The values could be then set when implementing the PodSets() function in the GenericJob interface via the PodSetTopologyRequest helper function like here.

Then, the API could be read from TopologyUngater, instead of the lookups.

mimowo commented 2 days ago

cc @PBundyra @tenzen-y @mwielgus @mwysokin