kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 698 forks source link

Support richer volcano scheduling #2182

Open shaoqingyang opened 3 months ago

shaoqingyang commented 3 months ago

What you would like to be added?

The current training operators, such as TFJob, cannot set queues and priorities, which can be achieved through annotations or other forms.

Why is this needed?

I need this to train my data.

Love this feature?

Give it a πŸ‘ We prioritize the features with most πŸ‘

andreyvelich commented 3 months ago

Thanks for creating this @shaoqingyang! I think, we have this APIs to specify queue and priorities to integrate with volcano scheduler: https://github.com/kubeflow/training-operator/blob/master/pkg/apis/kubeflow.org/v1/common_types.go#L231

Won't it work for you ?

cc @lowang-bh

Electronic-Waste commented 1 month ago

I have a question about https://www.kubeflow.org/docs/components/training/user-guides/job-scheduling/

For volcano and scheduler plugin, we need to configure the training-operator with:

...
    spec:
      containers:
        - command:
            - /manager
+           - --gang-scheduler-name=volcano
          image: kubeflow/training-operator
          name: training-operator
...

But when it comes to Kueue, we only need to specify the label in the metadata without modifying the configuration of training-operator, which is more simple and user-friendly:

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

May I ask why we didn't implement a unified scheduling framework for these three schedulers? What prevents us from implementing such a unified scheduling framework?

Also, current ManagedBy field in RunPolicy only supports kubeflow.org/training-operator and kueue.x-k8s.io/multikueue. Maybe it will make users puzzled and think we only support kueue(or just for me)?

PTAL if you have timeπŸ‘€ @kubeflow/wg-training-leads