Open huzhennan opened 9 months ago
@MortalHappiness can you take a look at this one?
Hi @huzhennan do you still have this problem after upgrading all the components? I cannot reproduce this error in new version of Ray following this guide.
https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/volcano.html
My tools Versions:
I'll update the volcano doc to use v1.2.1 yaml files instead of v1.0.0 because the Ray version in it is too old (2.7.0).
What happened + What you expected to happen
The kuberap-operator is restarting continuously and reporting the following exceptions: `024-01-31T10:54:43.616Z INFO controller.raycluster-controller Starting Controller {"reconciler group": "ray.io", "reconciler kind": "RayCluster"} 2024-01-31T10:54:43.879Z INFO controller.raycluster-controller Starting workers {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "worker count": 1} 2024-01-31T10:54:43.879Z INFO controller.rayservice Starting workers {"reconciler group": "ray.io", "reconciler kind": "RayService", "worker count": 1} 2024-01-31T10:54:43.879Z INFO controller.rayjob Starting workers {"reconciler group": "ray.io", "reconciler kind": "RayJob", "worker count": 1} 2024-01-31T10:54:43.879Z INFO controllers.RayCluster reconciling RayCluster {"cluster name": "test-cluster-0"} 2024-01-31T10:54:43.879Z INFO controllers.RayCluster Reconciling Ingress 2024-01-31T10:54:43.879Z INFO controllers.RayCluster reconcileHeadService {"1 head service found": "test-cluster-0-head-svc"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13cfc48]
goroutine 442 [running]: github.com/ray-project/kuberay/ray-operator/controllers/ray/batchscheduler/volcano.(VolcanoBatchScheduler).DoBatchSchedulingOnSubmission(0xc0006fe436?, 0xc0003d8000) /workspace/controllers/ray/batchscheduler/volcano/volcano_scheduler.go:55 +0xe8 github.com/ray-project/kuberay/ray-operator/controllers/ray.(RayClusterReconciler).reconcilePods(0xc000140280, {0x1984ed8, 0xc0000010e0}, 0xc0003d8000) /workspace/controllers/ray/raycluster_controller.go:557 +0x23f github.com/ray-project/kuberay/ray-operator/controllers/ray.(RayClusterReconciler).rayClusterReconcile(0xc000140280, {0x1984ed8, 0xc0000010e0}, {{{0xc0004a7647, 0x7}, {0xc0004a76d0, 0xe}}}, 0xc0003d8000) /workspace/controllers/ray/raycluster_controller.go:347 +0xea8 github.com/ray-project/kuberay/ray-operator/controllers/ray.(RayClusterReconciler).Reconcile(0xc000140280, {0x1984ed8, 0xc0000010e0}, {{{0xc0004a7647, 0x7}, {0xc0004a76d0, 0xe}}}) /workspace/controllers/ray/raycluster_controller.go:161 +0x21e sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile(0xc000314160, {0x1984ed8, 0xc000000fc0}, {{{0xc0004a7647?, 0x1670ce0?}, {0xc0004a76d0?, 0x408b34?}}}) /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x28b sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler(0xc000314160, {0x1984e30, 0xc000166980}, {0x15bda00?, 0xc0004b0700?}) /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x352 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem(0xc000314160, {0x1984e30, 0xc000166980}) /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x1d9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2() /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x31c`
Versions / Dependencies
Ray: 1.0.0 Volcano: 1.6.0/1.8.0 Kubernetes: k3s 1.24
Reproduction script
when create RayCluster, it happened
Issue Severity
High: It blocks me from completing my task.