ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.16k stars 373 forks source link

[Bug] Ray operator crashes when specifying RayCluster with `resources.limits` but no `resources.requests` #2076

Closed kwohlfahrt closed 5 months ago

kwohlfahrt commented 5 months ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

I upgraded the ray operator (and CRDs) from version 1.0.0 to 1.1.0. After the upgrade, the ray operator pod was crash-looping, with errors like the following (1.0.0 worked as expected):

{"level":"info","ts":"2024-04-11T07:43:43.385Z","logger":"setup","msg":"Flag watchNamespace is not set. Watch custom resources in all namespaces."}
{"level":"info","ts":"2024-04-11T07:43:43.385Z","logger":"setup","msg":"Setup manager"}
{"level":"info","ts":"2024-04-11T07:43:43.785Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2024-04-11T07:43:43.785Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":"2024-04-11T07:43:43.785Z","msg":"starting server","kind":"health probe","addr":"[::]:8082"}
{"level":"info","ts":"2024-04-11T07:43:43.785Z","logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":":8080","secure":false}
I0411 07:43:44.182159       1 leaderelection.go:250] attempting to acquire leader lease ray-system/ray-operator-leader...
I0411 07:44:00.183063       1 leaderelection.go:260] successfully acquired lease ray-system/ray-operator-leader
{"level":"info","ts":"2024-04-11T07:44:00.183Z","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.RayCluster"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.Pod"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.Service"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayJob","msg":"Starting EventSource","source":"kind source: *v1.RayJob"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayJob","msg":"Starting EventSource","source":"kind source: *v1.RayCluster"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayJob","msg":"Starting EventSource","source":"kind source: *v1.Service"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayJob","msg":"Starting EventSource","source":"kind source: *v1.Job"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayJob","msg":"Starting Controller"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayService","msg":"Starting EventSource","source":"kind source: *v1.RayService"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayService","msg":"Starting EventSource","source":"kind source: *v1.RayCluster"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayService","msg":"Starting EventSource","source":"kind source: *v1.Service"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayService","msg":"Starting EventSource","source":"kind source: *v1.Ingress"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayService","msg":"Starting Controller"}
{"level":"info","ts":"2024-04-11T07:44:00.184Z","logger":"controllers.RayCluster","msg":"Starting Controller"}
{"level":"info","ts":"2024-04-11T07:44:00.385Z","logger":"controllers.RayCluster","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-04-11T07:44:00.385Z","logger":"controllers.RayJob","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-04-11T07:44:00.385Z","logger":"controllers.RayCluster","msg":"Read request instance not found error!","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"431a795d-85b4-42fe-9d3b-f2904257f31f"}
{"level":"info","ts":"2024-04-11T07:44:00.482Z","logger":"controllers.RayService","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-04-11T07:44:00.740Z","logger":"controllers.RayCluster","msg":"Read request instance not found error!","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"450aa34e-f9d3-4787-bb20-940c00305ff5"}
{"level":"info","ts":"2024-04-11T07:44:00.770Z","logger":"controllers.RayCluster","msg":"Read request instance not found error!","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"06cd7b6d-c69a-4f7b-bb4f-1209539b7afd"}
{"level":"info","ts":"2024-04-11T07:44:00.776Z","logger":"controllers.RayCluster","msg":"Read request instance not found error!","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"7074d943-31b8-472a-8e1a-0bb10c9fb461"}
{"level":"info","ts":"2024-04-11T07:44:19.708Z","logger":"controllers.RayCluster","msg":"Read request instance not found error!","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"a9e9e875-26d9-49a9-a2f7-837ffcd14145"}
{"level":"info","ts":"2024-04-11T07:44:20.604Z","logger":"controllers.RayCluster","msg":"Read request instance not found error!","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"04fdf58e-012a-4cf7-a6b4-785fda3660b4"}
{"level":"info","ts":"2024-04-11T07:44:20.635Z","logger":"controllers.RayCluster","msg":"Read request instance not found error!","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ed5b0ba1-c424-4cf1-8cc5-34ef9067a605"}
{"level":"info","ts":"2024-04-11T07:44:20.650Z","logger":"controllers.RayCluster","msg":"Read request instance not found error!","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"4bc4cf50-d943-4e22-9b82-e703069e451c"}
{"level":"info","ts":"2024-04-11T07:44:20.657Z","logger":"controllers.RayCluster","msg":"Read request instance not found error!","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"9dda009f-2e0a-424c-927d-557c37d7048c"}
{"level":"info","ts":"2024-04-11T07:45:49.426Z","logger":"controllers.RayCluster","msg":"Reconciling Ingress","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047"}
{"level":"info","ts":"2024-04-11T07:45:49.447Z","logger":"controllers.RayCluster","msg":"Pod Service created successfully","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047","service name":"raycluster-kuberay-head-svc"}
{"level":"info","ts":"2024-04-11T07:45:49.447Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047","Found 0 head Pods; creating a head Pod for the RayCluster.":"raycluster-kuberay"}
{"level":"info","ts":"2024-04-11T07:45:49.447Z","logger":"controllers.RayCluster","msg":"head pod labels","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047","labels":{"app.kubernetes.io/created-by":"kuberay-operator","app.kubernetes.io/name":"kuberay","ray.io/cluster":"raycluster-kuberay","ray.io/group":"headgroup","ray.io/identifier":"raycluster-kuberay-head","ray.io/is-ray-node":"yes","ray.io/node-type":"head"}}
{"level":"info","ts":"2024-04-11T07:45:49.447Z","logger":"controllers.RayCluster","msg":"generateRayStartCommand","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047","nodeType":"head","rayStartParams":{"block":"true","dashboard-agent-listen-port":"52365","dashboard-host":"0.0.0.0","metrics-export-port":"8080"},"Ray container resource":{"limits":{"cpu":"1","memory":"1G"}}}
{"level":"info","ts":"2024-04-11T07:45:49.447Z","logger":"controllers.RayCluster","msg":"generateRayStartCommand","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047","rayStartCmd":"ray start --head  --dashboard-agent-listen-port=52365  --num-cpus=1  --memory=1000000000  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --block "}
{"level":"info","ts":"2024-04-11T07:45:49.447Z","logger":"controllers.RayCluster","msg":"BuildPod","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047","rayNodeType":"head","generatedCmd":"ulimit -n 65536; ray start --head  --dashboard-agent-listen-port=52365  --num-cpus=1  --memory=1000000000  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --block "}
{"level":"info","ts":"2024-04-11T07:45:49.447Z","logger":"controllers.RayCluster","msg":"Probes injection feature flag","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047","enabled":true}
{"level":"info","ts":"2024-04-11T07:45:49.447Z","logger":"controllers.RayCluster","msg":"createHeadPod","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047","head pod with name":"raycluster-kuberay-head-"}
{"level":"info","ts":"2024-04-11T07:45:49.510Z","logger":"controllers.RayCluster","msg":"Observed a panic in reconciler: assignment to entry in nil map","RayCluster":{"name":"raycluster-kuberay","namespace":"research"},"reconcileID":"ff76f4ce-535a-430c-b166-3bcd12496047"}
panic: assignment to entry in nil map [recovered]
    panic: assignment to entry in nil map

goroutine 252 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:116 +0x1fa
panic({0x1845e80, 0x1cd5560})
    /opt/hostedtoolcache/go/1.20.14/x64/src/runtime/panic.go:884 +0x213
github.com/ray-project/kuberay/ray-operator/controllers/ray/utils.calculatePodResource({{0xc000bd3400, 0x1, 0x1}, {0x0, 0x0, 0x0}, {0xc000a64300, 0x1, 0x1}, {0x0, ...}, ...})
    /home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/utils/util.go:343 +0x24c
github.com/ray-project/kuberay/ray-operator/controllers/ray/utils.CalculateDesiredResources(0xc00080c000)
    /home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/utils/util.go:311 +0x9f
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).calculateStatus(0xc0003e2700, {0x1cf04f8, 0xc000aff770}, 0xc0003c2c00?)
    /home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller.go:1262 +0x525
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).rayClusterReconcile(0xc0003e2700, {0x1cf04f8, 0xc000aff770}, {{{0xc000a8f5d8?, 0x8?}, {0xc000f188b8?, 0x12?}}}, 0xc0003c2c00)
    /home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller.go:363 +0x1ed7
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).Reconcile(0xc0003e2700, {0x1cf04f8, 0xc000aff770}, {{{0xc000a8f5d8?, 0x5?}, {0xc000f188b8?, 0xc000070d48?}}})
    /home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller.go:169 +0x225
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1cf2940?, {0x1cf04f8?, 0xc000aff770?}, {{{0xc000a8f5d8?, 0xb?}, {0xc000f188b8?, 0x0?}}})
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003ea000, {0x1cf0450, 0xc00020dc20}, {0x18c8340?, 0xc000e31820?})
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316 +0x3ca
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003ea000, {0x1cf0450, 0xc00020dc20})
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:223 +0x587

Reproduction script

  1. Install the ray operator from the Helm chart (no values specified)
  2. Apply the RayCluster YAML below
  3. Observe that the ray operator starts crash-looping
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay
spec:
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
        dashboard-host: "0.0.0.0"
    template:
      spec:
        containers:
          - volumeMounts:
            - mountPath: /tmp/ray
              name: log-volume
            name: ray-head
            image: rayproject/ray:2.9.3
            resources:
              limits:
                cpu: "1"
                memory: 1G
        volumes:
          - emptyDir: {}
            name: log-volume

Anything else

Changing the YAML to the following (note the added spec.headGroupSpec.template.spec.containers[0].requests) causes the operator to work as expected.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay
spec:
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
        dashboard-host: "0.0.0.0"
    template:
      spec:
        containers:
          - volumeMounts:
            - mountPath: /tmp/ray
              name: log-volume
            name: ray-head
            image: rayproject/ray:2.9.3
            resources:
              requests:
                cpu: "1"
                memory: 1G
              limits:
                cpu: "1"
                memory: 1G
        volumes:
          - emptyDir: {}
            name: log-volume

The crashing persists even if worker groups are added as normal, this was just the minimal config to show the bug.

I think specifying resources.limits without resources.requests should be allowed, this is documented here to cause the requests to be defaulted to the set limit.

On a more fundamental level, a single misconfigured ray cluster should probably not be able to crash the operator for the entire cluster.

Are you willing to submit a PR?

kevin85421 commented 5 months ago

Thank you for reporting the issue!

kwohlfahrt commented 5 months ago

Thanks for the quick fix! @kevin85421 would it be possible to publish 1.1.1 patch release with this change included?