ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.19k stars 386 forks source link

[Bug] RayService not deploying when enableInTreeAutoscaling is true #643

Closed gariepyalex closed 1 year ago

gariepyalex commented 1 year ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When deploying a RayService, pods of the ray cluster are not being started if enableInTreeAutoscaling is true. I can see that the RayCluster and RayService resources exist in the Kubernetes cluster.

Here are the logs of the operator:

sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2022-10-17T20:13:18.415Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray-serve-prototype-v2","name":"rayservice-xgboost-autoscaling-model","uid":"dcfd592d-208f-4f69-baca-87b0000627c7","apiVersion":"ray.io/v1alpha1","resourceVersion":"62464606"}, "reason": "WaitForDashboard", "message": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
2022-10-17T20:13:18.415Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
2022-10-17T20:13:18.416Z    INFO    controllers.RayService    Reconciling the Serve component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
pod name is too long: len = 67, we will shorten it by offset = 17
2022-10-17T20:13:18.490Z    ERROR    controllers.RayService    Fail to reconcileServe.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model", "error": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2022-10-17T20:13:18.490Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray-serve-prototype-v2","name":"rayservice-xgboost-autoscaling-model","uid":"dcfd592d-208f-4f69-baca-87b0000627c7","apiVersion":"ray.io/v1alpha1","resourceVersion":"62464606"}, "reason": "WaitForDashboard", "message": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
2022-10-17T20:13:20.416Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
pod name is too long: len = 67, we will shorten it by offset = 17
2022-10-17T20:13:20.417Z    INFO    controllers.RayService    Reconciling the Serve component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
2022-10-17T20:13:20.435Z    ERROR    controllers.RayService    Fail to reconcileServe.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model", "error": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2022-10-17T20:13:20.436Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray-serve-prototype-v2","name":"rayservice-xgboost-autoscaling-model","uid":"dcfd592d-208f-4f69-baca-87b0000627c7","apiVersion":"ray.io/v1alpha1","resourceVersion":"62464649"}, "reason": "WaitForDashboard", "message": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
2022-10-17T20:13:20.436Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
2022-10-17T20:13:20.490Z    INFO    controllers.RayService    Reconciling the Serve component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
pod name is too long: len = 67, we will shorten it by offset = 17
2022-10-17T20:13:20.502Z    ERROR    controllers.RayService    Fail to reconcileServe.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model", "error": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2022-10-17T20:13:20.502Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray-serve-prototype-v2","name":"rayservice-xgboost-autoscaling-model","uid":"dcfd592d-208f-4f69-baca-87b0000627c7","apiVersion":"ray.io/v1alpha1","resourceVersion":"62464649"}, "reason": "WaitForDashboard", "message": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
2022-10-17T20:13:22.436Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
2022-10-17T20:13:22.437Z    INFO    controllers.RayService    Reconciling the Serve component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
pod name is too long: len = 67, we will shorten it by offset = 17
2022-10-17T20:13:22.453Z    ERROR    controllers.RayService    Fail to reconcileServe.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model", "error": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2022-10-17T20:13:22.453Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray-serve-prototype-v2","name":"rayservice-xgboost-autoscaling-model","uid":"dcfd592d-208f-4f69-baca-87b0000627c7","apiVersion":"ray.io/v1alpha1","resourceVersion":"62464692"}, "reason": "WaitForDashboard", "message": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
2022-10-17T20:13:22.453Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
2022-10-17T20:13:22.454Z    INFO    controllers.RayService    Reconciling the Serve component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
pod name is too long: len = 67, we will shorten it by offset = 17
2022-10-17T20:13:22.469Z    ERROR    controllers.RayService    Fail to reconcileServe.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model", "error": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2022-10-17T20:13:22.469Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray-serve-prototype-v2","name":"rayservice-xgboost-autoscaling-model","uid":"dcfd592d-208f-4f69-baca-87b0000627c7","apiVersion":"ray.io/v1alpha1","resourceVersion":"62464692"}, "reason": "WaitForDashboard", "message": "Service \"t-autoscaling-model-raycluster-9rtjq-dashboard-svc\" not found"}
2022-10-17T20:13:24.454Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}
2022-10-17T20:13:24.455Z    INFO    controllers.RayService    Reconciling the Serve component.    {"ServiceName": "ray-serve-prototype-v2/rayservice-xgboost-autoscaling-model"}

Reproduction script

Note that the following RayService successfully deploys without enableInTreeAutoscaling: true

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-xgboost-autoscaling-model
spec:
  serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for service. Default value is 60.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for deployments. Default value is 60.
  serveConfig:
    importPath: example_xgboost.model
  rayClusterConfig:
    rayVersion: '2.0.0' # should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    ######################headGroupSpecs#################################
    # head group template and specs, (perhaps 'group' is not needed in the name)
    headGroupSpec:
      # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
      serviceType: ClusterIP
      # the pod replicas in this group typed head (assuming there could be more than 1 in the future)
      replicas: 1
      # logical group name, for this called head-group, also can be functional
      # pod type head or worker
      # rayNodeType: head # Not needed since it is under the headgroup
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        port: '6379' # should match container port named gcs-server
        #include_webui: 'true'
        object-store-memory: '100000000'
        # webui_host: "10.1.2.60"
        dashboard-host: '0.0.0.0'
        num-cpus: '2' # can be auto-completed from the limits
        node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
        block: 'true'
      #pod template
      template:
        metadata:
          labels:
            # custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
            # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
            rayCluster: raycluster-sample # will be injected if missing
            rayNodeType: head # will be injected if missing, must be head or wroker
            groupName: headgroup # will be injected if missing
          # annotations for pod
          annotations:
            key: value
        spec:
          nodeSelector:
            node.kubernetes.io/instance-type: n1-standard-8
          containers:
            - name: ray-head
              image:  <custom image based on rayproject/ray:2.0.0-py38-cu102>
              imagePullPolicy: Always
              #image: bonsaidev.azurecr.io/bonsai/lazer-0-9-0-cpu:dev
              env:
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
                - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                  value: "0"
              resources:
                limits:
                  cpu: 4
                  memory: 4Gi
                requests:
                  cpu: 4
                  memory: 4Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 3
        minReplicas: 2
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # if worker pods need to be added, we can simply increment the replicas
        # if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
        # the operator will remove pods from the list until the number of replicas is satisfied
        # when a pod is confirmed to be deleted, its name will be removed from the list below
        #scaleStrategy:
        #  workersToDelete:
        #  - raycluster-complete-worker-small-group-bdtwh
        #  - raycluster-complete-worker-small-group-hv457
        #  - raycluster-complete-worker-small-group-k8tj7
        # the following params are used to complete the ray start: ray start --block --node-ip-address= ...
        rayStartParams:
          node-ip-address: $MY_POD_IP
          block: 'true'
        #pod template
        template:
          metadata:
            labels:
              key: value
            # annotations for pod
            annotations:
              key: value
          spec:
            nodeSelector:
              node.kubernetes.io/instance-type: n1-standard-8
            initContainers:
              # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
              - name: init-myservice
                image: busybox:1.28
                command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
            containers:
              - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: <custom image based on rayproject/ray:2.0.0-py38-cu102>
                imagePullPolicy: Always
                # environment variables to set in the container.Optional.
                # Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
                env:
                  - name:  RAY_DISABLE_DOCKER_CPU_WARNING
                    value: "1"
                  - name: TYPE
                    value: "worker"
                  - name: CPU_REQUEST
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.cpu
                  - name: CPU_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.cpu
                  - name: MEMORY_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.memory
                  - name: MEMORY_REQUESTS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.memory
                  - name: MY_POD_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: MY_POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                    value: "0"
                ports:
                  - containerPort: 80
                    name: client
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "2"
                    memory: "2Gi"
                  requests:
                    cpu: "2"
                    memory: "2Gi"

Anything else

I'm using a namespace-scope operator and the nightly image of Kuberay

apiVersion: v1
kind: Namespace
metadata:
  name: ray-serve-prototype-v2
  labels:
    app: ray
---
#############################################################################################################
# Operator-related
#############################################################################################################
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
    app: ray
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  creationTimestamp: null
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
rules:
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - create
  - get
  - list
  - update
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ""
  resources:
  - serviceaccounts
  verbs:
  - create
  - delete
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - services
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ""
  resources:
  - services/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingressclasses
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ray.io
  resources:
  - rayclusters
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ray.io
  resources:
  - rayclusters/finalizer
  verbs:
  - update
- apiGroups:
  - ray.io
  resources:
  - rayclusters/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - ray.io
  resources:
  - rayjobs
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ray.io
  resources:
  - rayjobs/finalizer
  verbs:
  - update
- apiGroups:
  - ray.io
  resources:
  - rayjobs/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - ray.io
  resources:
  - rayservices
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ray.io
  resources:
  - rayservices/finalizers
  verbs:
  - update
- apiGroups:
  - ray.io
  resources:
  - rayservices/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - rolebindings
  verbs:
  - create
  - delete
  - get
  - list
  - watch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles
  verbs:
  - create
  - delete
  - get
  - list
  - update
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
  name: kuberay-operator-leader-election
  namespace: ray-serve-prototype-v2
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - configmaps/status
  verbs:
  - get
  - update
  - patch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kuberay-operator
subjects:
- kind: ServiceAccount
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
  name: kuberay-operator-leader-election
  namespace: ray-serve-prototype-v2
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kuberay-operator-leader-election
subjects:
- kind: ServiceAccount
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8080"
    prometheus.io/scrape: "true"
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
spec:
  ports:
  - name: monitoring-port
    port: 8080
    targetPort: 8080
  selector:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
    app: ray
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: kuberay-operator
      app.kubernetes.io/name: kuberay
  template:
    metadata:
      labels:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
    spec:
      containers:
      - command:
        - /manager
        - -watch-namespace
        - ray-serve-prototype-v2
        image: kuberay/operator:nightly
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: http
          initialDelaySeconds: 10
          periodSeconds: 5
        name: kuberay-operator
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: http
          initialDelaySeconds: 10
          periodSeconds: 5
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
      securityContext:
        runAsNonRoot: true
      serviceAccountName: kuberay-operator
      terminationGracePeriodSeconds: 10

Are you willing to submit a PR?

gariepyalex commented 1 year ago

cc. @DmitriGekhtman

DmitriGekhtman commented 1 year ago

Thanks for all of the details! I will take a look.

DmitriGekhtman commented 1 year ago

Reproduced the issue with the provided config. Looking into causes.

DmitriGekhtman commented 1 year ago

I've identified the issue -- there's a bug stemming from inconsistency in the RayCluster controller's naming for the autoscaler's Role. The bug only occurs when the name of RayCluster is long enough, which is liable to happen with the RayCluster name generated by the RayService controller.

I will open a PR fixing the bug.

The short-term workaround is to use a shorter name for your RayService.

DmitriGekhtman commented 1 year ago

I was able to deploy the RayService successfully by shortening its name to "rxam".

shopigarner commented 1 year ago

Not to conflate issues, but we're also seeing an issue where the head-svc endpoint names are being truncated for names > 50 characters, could this be related?

DmitriGekhtman commented 1 year ago

Truncation is necessary due to K8s length limits. Let me go back and fix the issue with char length limits -- it slipped my mind.

DmitriGekhtman commented 1 year ago

Fix: https://github.com/ray-project/kuberay/pull/689