ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.9k stars 5.76k forks source link

KubeRay integration with Volcano, exception #42870

Open huzhennan opened 9 months ago

huzhennan commented 9 months ago

What happened + What you expected to happen

The kuberap-operator is restarting continuously and reporting the following exceptions: `024-01-31T10:54:43.616Z INFO controller.raycluster-controller Starting Controller {"reconciler group": "ray.io", "reconciler kind": "RayCluster"} 2024-01-31T10:54:43.879Z INFO controller.raycluster-controller Starting workers {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "worker count": 1} 2024-01-31T10:54:43.879Z INFO controller.rayservice Starting workers {"reconciler group": "ray.io", "reconciler kind": "RayService", "worker count": 1} 2024-01-31T10:54:43.879Z INFO controller.rayjob Starting workers {"reconciler group": "ray.io", "reconciler kind": "RayJob", "worker count": 1} 2024-01-31T10:54:43.879Z INFO controllers.RayCluster reconciling RayCluster {"cluster name": "test-cluster-0"} 2024-01-31T10:54:43.879Z INFO controllers.RayCluster Reconciling Ingress 2024-01-31T10:54:43.879Z INFO controllers.RayCluster reconcileHeadService {"1 head service found": "test-cluster-0-head-svc"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13cfc48]

goroutine 442 [running]: github.com/ray-project/kuberay/ray-operator/controllers/ray/batchscheduler/volcano.(VolcanoBatchScheduler).DoBatchSchedulingOnSubmission(0xc0006fe436?, 0xc0003d8000) /workspace/controllers/ray/batchscheduler/volcano/volcano_scheduler.go:55 +0xe8 github.com/ray-project/kuberay/ray-operator/controllers/ray.(RayClusterReconciler).reconcilePods(0xc000140280, {0x1984ed8, 0xc0000010e0}, 0xc0003d8000) /workspace/controllers/ray/raycluster_controller.go:557 +0x23f github.com/ray-project/kuberay/ray-operator/controllers/ray.(RayClusterReconciler).rayClusterReconcile(0xc000140280, {0x1984ed8, 0xc0000010e0}, {{{0xc0004a7647, 0x7}, {0xc0004a76d0, 0xe}}}, 0xc0003d8000) /workspace/controllers/ray/raycluster_controller.go:347 +0xea8 github.com/ray-project/kuberay/ray-operator/controllers/ray.(RayClusterReconciler).Reconcile(0xc000140280, {0x1984ed8, 0xc0000010e0}, {{{0xc0004a7647, 0x7}, {0xc0004a76d0, 0xe}}}) /workspace/controllers/ray/raycluster_controller.go:161 +0x21e sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile(0xc000314160, {0x1984ed8, 0xc000000fc0}, {{{0xc0004a7647?, 0x1670ce0?}, {0xc0004a76d0?, 0x408b34?}}}) /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x28b sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler(0xc000314160, {0x1984e30, 0xc000166980}, {0x15bda00?, 0xc0004b0700?}) /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x352 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem(0xc000314160, {0x1984e30, 0xc000166980}) /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x1d9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2() /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x31c`

Versions / Dependencies

Ray: 1.0.0 Volcano: 1.6.0/1.8.0 Kubernetes: k3s 1.24

Reproduction script

when create RayCluster, it happened

Issue Severity

High: It blocks me from completing my task.

jjyao commented 1 month ago

@MortalHappiness can you take a look at this one?

MortalHappiness commented 1 month ago

Hi @huzhennan do you still have this problem after upgrading all the components? I cannot reproduce this error in new version of Ray following this guide.

https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/volcano.html

My tools Versions:

I'll update the volcano doc to use v1.2.1 yaml files instead of v1.0.0 because the Ray version in it is too old (2.7.0).