Open shaoqingyang opened 3 months ago
Thanks for creating this @shaoqingyang! I think, we have this APIs to specify queue and priorities to integrate with volcano scheduler: https://github.com/kubeflow/training-operator/blob/master/pkg/apis/kubeflow.org/v1/common_types.go#L231
Won't it work for you ?
cc @lowang-bh
I have a question about https://www.kubeflow.org/docs/components/training/user-guides/job-scheduling/
For volcano and scheduler plugin, we need to configure the training-operator with:
...
spec:
containers:
- command:
- /manager
+ - --gang-scheduler-name=volcano
image: kubeflow/training-operator
name: training-operator
...
But when it comes to Kueue, we only need to specify the label in the metadata without modifying the configuration of training-operator, which is more simple and user-friendly:
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
May I ask why we didn't implement a unified scheduling framework for these three schedulers? What prevents us from implementing such a unified scheduling framework?
Also, current ManagedBy
field in RunPolicy
only supports kubeflow.org/training-operator
and kueue.x-k8s.io/multikueue
. Maybe it will make users puzzled and think we only support kueue
(or just for me)?
PTAL if you have timeπ @kubeflow/wg-training-leads
What you would like to be added?
The current training operators, such as TFJob, cannot set queues and priorities, which can be achieved through annotations or other forms.
Why is this needed?
I need this to train my data.
Love this feature?
Give it a π We prioritize the features with most π