[X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
It is great to see RayOperator supports enable-batch-scheduler option and works with Kubernetes batch schedulers. But this bool flag is not ideal to support more-than-one scheduler option. When I work on https://github.com/ray-project/kuberay/pull/2184, I found whenever I set "enable-batch-scheduler: true", the controller will fail to start because it cannot load all Volcano APIs. The framework shouldn't assume there is only one option.
The controller fails with the following error:
{"level":"info","ts":"2024-06-10T05:12:40.428Z","logger":"setup","msg":"Feature flag enable-batch-scheduler is enabled."}
...
{"level":"error","ts":"2024-06-10T05:12:56.218Z","logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: scheduling.volcano.sh/v1beta1: the server could not find the requested resource","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/source/kind.go:68\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/loop.go:49\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/loop.go:50\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/source/kind.go:56"}
...
{"level":"error","ts":"2024-06-10T05:14:56.329Z","logger":"setup","msg":"problem running manager","error":"failed to wait for raycluster caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.PodGroup","stacktrace":"main.exitOnError\n\t/workspace/main.go:247\nmain.main\n\t/workspace/main.go:230\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
Install KubeRay with "enable-batch-scheduler=true", and without installing Volcano.
Anything else
Ray should support other batch scheduler options, it shouldn't assume the batch scheduler is always Volcano when the flag is enabled.
For example, if we can modify the option to something like: batch-scheduler: Set a Kubernetes batch scheduler name to let Ray work with a non-default scheduler, currently supported volcano, yunikorn (once integrated) and ANOTHER_SUPPORTED_SCHEDULER_NAME. So in the future, it is easy to extend supported batch schedulers. And also, there is no valid scenario that people would use more than 1 batch scheduler at the same time, so it has to be one.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
It is great to see RayOperator supports
enable-batch-scheduler
option and works with Kubernetes batch schedulers. But this bool flag is not ideal to supportmore-than-one
scheduler option. When I work on https://github.com/ray-project/kuberay/pull/2184, I found whenever I set "enable-batch-scheduler: true", the controller will fail to start because it cannot load all Volcano APIs. The framework shouldn't assume there is only one option.The controller fails with the following error:
caused by this: https://github.pie.apple.com/apple-ray/kuberay/blob/a671cbdebded64a6e444c692c2ab9be2ccd6095c/ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go#L181-L183.
Reproduction script
Install KubeRay with "enable-batch-scheduler=true", and without installing Volcano.
Anything else
Ray should support other batch scheduler options, it shouldn't assume the batch scheduler is always Volcano when the flag is enabled.
For example, if we can modify the option to something like:
batch-scheduler
: Set a Kubernetes batch scheduler name to let Ray work with a non-default scheduler, currently supportedvolcano
,yunikorn
(once integrated) andANOTHER_SUPPORTED_SCHEDULER_NAME
. So in the future, it is easy to extend supported batch schedulers. And also, there is no valid scenario that people would use more than 1 batch scheduler at the same time, so it has to be one.Are you willing to submit a PR?