ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.3k stars 417 forks source link

[Bug] RayJob Volcano integration #1580

Open zhiyi57 opened 1 year ago

zhiyi57 commented 1 year ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When label RayJob with volcano scheduler and queue,ray operator crashes. Raycluster object would be created with no status. None of relevant cluster pod would be created. Rayjob is the sample job provided in examples. This happens in both v.0.6.0 and v.1.0.0 Tried same setup for RayCluster, everything works as expected.

The log from ray operator is the following:

INFO    controllers.RayCluster  reconcileHeadService    {"1 head service found": "rayjob-sample-raycluster-klhw9-head-svc"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13ad868]

goroutine 786 [running]:
github.com/ray-project/kuberay/ray-operator/controllers/ray/batchscheduler/volcano.(*VolcanoBatchScheduler).DoBatchSchedulingOnSubmission(0xc001758368?, 0xc008176500)
        /workspace/controllers/ray/batchscheduler/volcano/volcano_scheduler.go:55 +0xe8
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).reconcilePods(0xc0003054f0, {0x194d558, 0xc008536ed0}, 0xc008176500)
        /workspace/controllers/ray/raycluster_controller.go:550 +0x23f
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).rayClusterReconcile(0xc0003054f0, {0x194d558, 0xc008536ed0}, {{{0xc00812ee68, 0x8}, {0xc00814ab00, 0x1e}}}, 0xc008176500)
        /workspace/controllers/ray/raycluster_controller.go:340 +0xea8
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).Reconcile(0xc0003054f0, {0x194d558, 0xc008536ed0}, {{{0xc00812ee68, 0x8}, {0xc00814ab00, 0x1e}}})
        /workspace/controllers/ray/raycluster_controller.go:158 +0x21e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00025edc0, {0x194d558, 0xc008536e10}, {{{0xc00812ee68?, 0x163d900?}, {0xc00814ab00?, 0x4045d4?}}})
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x28b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00025edc0, {0x194d4b0, 0xc000d2f040}, {0x158c6a0?, 0xc006dccf00?})
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x352
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00025edc0, {0x194d4b0, 0xc000d2f040})
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x31c
Reproduced steadily. Volcano works fine with RayCluster, it’s only problematic with RayJob.  (both v.6.0.0 and V.1.0.0) 

Reproduction script

apiVersion: ray.io/v1alpha1 kind: RayJob metadata: name: rayjob-sample labels: ray.io/scheduler-name: volcano volcano.sh/queue-name: kuberay-test-queue spec: entrypoint: python /home/ray/samples/sample_code.py rayClusterSpec: rayVersion: '2.7.0' # should match the Ray version in the image of the containers

Ray head pod template

headGroupSpec:

...

Anything else

The immediate fix is pretty trivial. Not sure whether further refactor is desired. No response

Are you willing to submit a PR?

zhiyi57 commented 1 year ago

@Jeffwan FYI

architkulkarni commented 1 year ago

@zhiyi57

The immediate fix is pretty trivial

What fix is this? Would you be willing to open a PR with the fix?

zhiyi57 commented 1 year ago

@zhiyi57

The immediate fix is pretty trivial

What fix is this? Would you be willing to open a PR with the fix?

The basic fix is to check whether this filled is set or not before use it. A better fix would be set default value on initialization. Not sure whether we would want a pr down on this road though.

hylent commented 8 months ago

Any updates of this issue?