ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.19k stars 384 forks source link

[Bug] Kuberay can not start when enableBatchScheduler=true #2430

Open KunWuLuan opened 2 days ago

KunWuLuan commented 2 days ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When i start the controller with

            "args": [
                "--kubeconfig=/Users/yueming/.kube/config", 
                "--enable-leader-election=false",
                "--leader-election-namespace=default",
                "--enable-batch-scheduler=true",
            ]

in vscode. The controller is blocked by cache syncing:

{"level":"error","ts":"2024-10-09T19:35:37.389+0800","logger":"controller-runtime.source.EventHandler","msg":"if kind is a CRD, it should be installed before calling Start","kind":"PodGroup.scheduling.volcano.sh","error":"no matches for kind \"PodGroup\" in version \"scheduling.volcano.sh/v1beta1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1\n\t/Users/yueming/go/src/gitlab.alibaba-inc.com/eml/kuberay/ray-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2\n\t/Users/yueming/go/src/gitlab.alibaba-inc.com/eml/kuberay/ray-operator/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/Users/yueming/go/src/gitlab.alibaba-inc.com/eml/kuberay/ray-operator/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/Users/yueming/go/src/gitlab.alibaba-inc.com/eml/kuberay/ray-operator/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1\n\t/Users/yueming/go/src/gitlab.alibaba-inc.com/eml/kuberay/ray-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56"}

Finally, the controller failed:

{"level":"error","ts":"2024-10-09T19:35:57.493+0800","logger":"setup","msg":"problem running manager","error":"failed to wait for raycluster caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.PodGroup","stacktrace":"main.exitOnError\n\t/Users/yueming/go/src/gitlab.alibaba-inc.com/eml/kuberay/ray-operator/main.go:270\nmain.main\n\t/Users/yueming/go/src/gitlab.alibaba-inc.com/eml/kuberay/ray-operator/main.go:247\nruntime.main\n\t/opt/homebrew/Cellar/go/1.22.5/libexec/src/runtime/proc.go:271"}

There is no PodGroup CRD in my cluster.

After I remove the code https://github.com/ray-project/kuberay/blob/bf21d2d01cf1c931136d869d2c8168aed07bc68c/ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go#L184-L186, the controller can be started.

Reproduction script

{
    // 使用 IntelliSense 了解相关属性。 
    // 悬停以查看现有属性的描述。
    // 欲了解更多信息,请访问: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Launch Package",
            "type": "go",
            "request": "launch",
            "mode": "auto",
            "program": "ray-operator/main.go",
            "cwd": "${workspaceFolder}",
            "args": [
                "--kubeconfig=xxx", 
                "--enable-leader-election=false",
                "--leader-election-namespace=default",
                "--enable-batch-scheduler=true",
            ]
        }
    ]
}

Anything else

No response

Are you willing to submit a PR?

kevin85421 commented 2 days ago

You should install the Volcano scheduler.

KunWuLuan commented 17 hours ago

I just want to enable the batch scheduler. And I have not submit any job using volcano scheduler. I still need to install volcano even if I just want to use yunikorn?

andrewsykim commented 17 hours ago

If you want to use Yunikorn, you should use --batch-scheduler=yunikorn and not --enable-batch-scheduler. We deprecated --enable-batch-scheduler=true in favor of --batch-scheduler=yunikorn|volcano when we added yunikorn support: https://github.com/ray-project/kuberay/pull/2300

KunWuLuan commented 17 hours ago

Thanks for reply @andrewsykim . If I enable --batch-scheduler=yunikorn when using ray-operator, I can still set ray.io/scheduler-name=volcano on RayJob, right? What will happen in this scenario?

andrewsykim commented 17 hours ago

If I enable --batch-scheduler=yunikorn when using ray-operator, I can still set ray.io/scheduler-name=volcano on RayJob, right?

No I don't think so, KubeRay only supports one batch scheduler at a time. What is your use-case? Are you trying to use both Yunikorn and Volcano?

kadisi commented 17 hours ago

If I enable --batch-scheduler=yunikorn when using ray-operator, I can still set ray.io/scheduler-name=volcano on RayJob, right?

No I don't think so, KubeRay only supports one batch scheduler at a time. What is your use-case? Are you trying to use both Yunikorn and Volcano?

@andrewsykim If we set kuberay to the --batch-scheduler parameter, there is no need for the user to set the raycluster label ray.io/scheduler-name=***. Judging from the latest code, it's still the old logic。 If we enable --batch-scheduler=yunikorn when can still using ray-operator, I can still set ray.io/scheduler-name=volcano

KunWuLuan commented 17 hours ago

We provide managed ray-operator for our users. Submitting RayJob is the job of our users, they may make mistakes, and no event is sent when the wrong schedulerName is set on the RayJobs. Once we only support one batch scheduler, why we let user choose scheduler on RayJob by ray.io/scheduler-name? Maybe they just need to use label like ray.io/enable-batch-scheduler on RayJob. How do you think?