ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Bug] "enable-batch-scheduler" bool flag is not working for schedulers other than Volcano #2185

Open yangwwei opened 2 weeks ago

yangwwei commented 2 weeks ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

It is great to see RayOperator supports enable-batch-scheduler option and works with Kubernetes batch schedulers. But this bool flag is not ideal to support more-than-one scheduler option. When I work on https://github.com/ray-project/kuberay/pull/2184, I found whenever I set "enable-batch-scheduler: true", the controller will fail to start because it cannot load all Volcano APIs. The framework shouldn't assume there is only one option.

The controller fails with the following error:

{"level":"info","ts":"2024-06-10T05:12:40.428Z","logger":"setup","msg":"Feature flag enable-batch-scheduler is enabled."}
...
{"level":"error","ts":"2024-06-10T05:12:56.218Z","logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: scheduling.volcano.sh/v1beta1: the server could not find the requested resource","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/source/kind.go:68\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/loop.go:49\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/loop.go:50\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/go/pkg/mod/k8s.io/apimachinery@v0.28.4/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/source/kind.go:56"}
...
{"level":"error","ts":"2024-06-10T05:14:56.329Z","logger":"setup","msg":"problem running manager","error":"failed to wait for raycluster caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.PodGroup","stacktrace":"main.exitOnError\n\t/workspace/main.go:247\nmain.main\n\t/workspace/main.go:230\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

caused by this: https://github.pie.apple.com/apple-ray/kuberay/blob/a671cbdebded64a6e444c692c2ab9be2ccd6095c/ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go#L181-L183.

Reproduction script

Install KubeRay with "enable-batch-scheduler=true", and without installing Volcano.

Anything else

Ray should support other batch scheduler options, it shouldn't assume the batch scheduler is always Volcano when the flag is enabled.

For example, if we can modify the option to something like: batch-scheduler: Set a Kubernetes batch scheduler name to let Ray work with a non-default scheduler, currently supported volcano, yunikorn (once integrated) and ANOTHER_SUPPORTED_SCHEDULER_NAME. So in the future, it is easy to extend supported batch schedulers. And also, there is no valid scenario that people would use more than 1 batch scheduler at the same time, so it has to be one.

Are you willing to submit a PR?