Open KunWuLuan opened 1 month ago
You should install the Volcano scheduler.
I just want to enable the batch scheduler. And I have not submit any job using volcano scheduler. I still need to install volcano even if I just want to use yunikorn?
If you want to use Yunikorn, you should use --batch-scheduler=yunikorn
and not --enable-batch-scheduler
. We deprecated --enable-batch-scheduler=true
in favor of --batch-scheduler=yunikorn|volcano
when we added yunikorn support: https://github.com/ray-project/kuberay/pull/2300
Thanks for reply @andrewsykim . If I enable --batch-scheduler=yunikorn
when using ray-operator, I can still set ray.io/scheduler-name=volcano
on RayJob, right? What will happen in this scenario?
If I enable --batch-scheduler=yunikorn when using ray-operator, I can still set ray.io/scheduler-name=volcano on RayJob, right?
No I don't think so, KubeRay only supports one batch scheduler at a time. What is your use-case? Are you trying to use both Yunikorn and Volcano?
If I enable --batch-scheduler=yunikorn when using ray-operator, I can still set ray.io/scheduler-name=volcano on RayJob, right?
No I don't think so, KubeRay only supports one batch scheduler at a time. What is your use-case? Are you trying to use both Yunikorn and Volcano?
@andrewsykim If we set kuberay to the --batch-scheduler
parameter, there is no need for the user to set the raycluster label ray.io/scheduler-name=***
. Judging from the latest code, it's still the old logic。 If we enable --batch-scheduler=yunikorn when can still using ray-operator, I can still set ray.io/scheduler-name=volcano
We provide managed ray-operator for our users. Submitting RayJob is the job of our users, they may make mistakes, and no event is sent when the wrong schedulerName is set on the RayJobs. Once we only support one batch scheduler, why we let user choose scheduler on RayJob by ray.io/scheduler-name
? Maybe they just need to use label like ray.io/enable-batch-scheduler
on RayJob.
How do you think?
cc @MortalHappiness can you take a look?
@kevin85421 I was wondering if we should take this opportunity to go ahead and remove this deprecated --enable-batch-scheduler=true
flag?
If we don't remove it, we'll need to define the exact behavior required here. What if --enable-batch-scheduler=true
was set and volcano
was initially not installed but installed later?
By the way, RayJob currently doesn't support Volcano and YuniKorn. At the moment, Volcano and YuniKorn are only supported in RayCluster. If you want to use advanced scheduling for RayJob / RayCluster, you should use Kueue for now. cc @KunWuLuan @kadisi
We provide managed ray-operator for our users. Submitting RayJob is the job of our users, they may make mistakes, and no event is sent when the wrong schedulerName is set on the RayJobs. Once we only support one batch scheduler, why we let user choose scheduler on RayJob by ray.io/scheduler-name? Maybe they just need to use label like ray.io/enable-batch-scheduler on RayJob.
enable-batch-scheduler
was deprecated in KubeRay v1.2. If it is specified, Volcano will always be used, regardless of the value of ray.io/scheduler-name
or whether ray.io/scheduler-name
is specified based on: https://github.com/ray-project/kuberay/blob/8b61b73bc57d7776ccac8b88b215df22aefaabbd/ray-operator/controllers/ray/batchscheduler/schedulermanager.go#L73
The issue is that the RBAC in the KubeRay v1.2.2 Helm chart doesn't update correctly. As a result, if enable-batch-scheduler
is not set, the RBAC for PodGroup will not be installed.
Install KubeRay operator
helm install kuberay-operator kuberay/kuberay-operator --version 1.2.2 --set batchScheduler.enabled=true
Install RayCluster with ray.io/scheduler-name: abc
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: test-cluster-0
labels:
ray.io/scheduler-name: abc
spec:
rayVersion: '2.9.0'
headGroupSpec:
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.9.0
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "2Gi"
workerGroupSpecs: []
Check whether PodGroup is created correctly or not.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When i start the controller with
in vscode. The controller is blocked by cache syncing:
Finally, the controller failed:
There is no PodGroup CRD in my cluster.
After I remove the code https://github.com/ray-project/kuberay/blob/bf21d2d01cf1c931136d869d2c8168aed07bc68c/ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go#L184-L186, the controller can be started.
Reproduction script
Anything else
No response
Are you willing to submit a PR?