ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.19k stars 386 forks source link

[Bug] `RayService` restarts repeatedly when a new config is applied #626

Closed shrekris-anyscale closed 1 year ago

shrekris-anyscale commented 2 years ago

Search before asking

KubeRay Component

Others

What happened + What you expected to happen

A user had this Serve app:

Python Code
from starlette.requests import Request
import joblib
import ray
from ray import serve
import pandas as pd
import os
import logging
from typing import List
from xgboost import DMatrix

MODEL_PATH = os.environ.get("FRAUD_MODEL_PATH", "./model.joblib")

@serve.deployment(num_replicas=3)
class FraudDetection:
    def __init__(self):
        with open(MODEL_PATH, "rb") as f:
            model = joblib.load(f)
        self.preprocessor = model['preprocessors']
        self.clf = model['clf'].get_booster()

    def predict(self, features: pd.DataFrame) -> str:
        data = self.preprocessor.transform(features)
        predictions = self.clf.predict(DMatrix(data))
        return predictions

    async def __call__(self, http_request: Request) -> str:
        features_json: List[dict] = await http_request.json()
        return self.predict(pd.DataFrame(features_json))

model = FraudDetection.bind()
Docker Image
FROM rayproject/ray:2.0.0-py38-cu102

COPY requirements.txt .

RUN pip install --no-cache-dir --upgrade pip \
  && pip install --no-cache-dir -r requirements.txt \
  && rm requirements.txt

COPY src/* .

ENV FRAUD_MODEL_PATH=/home/ray/model.joblib
Kubernetes Config
#############################################################################################################
# RayService related
#############################################################################################################
# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-xgboost-model
spec:
  serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for service. Default value is 60.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for deployments. Default value is 60.
  serveConfig:
    importPath: example_xgboost.model
    deployments:
      - name: FraudDetection
        numReplicas: 3
        routePrefix: "/"
  rayClusterConfig:
    rayVersion: '2.0.0' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # head group template and specs, (perhaps 'group' is not needed in the name)
    headGroupSpec:
      # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
      serviceType: ClusterIP
      # the pod replicas in this group typed head (assuming there could be more than 1 in the future)
      replicas: 1
      # logical group name, for this called head-group, also can be functional
      # pod type head or worker
      # rayNodeType: head # Not needed since it is under the headgroup
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        port: '6379' # should match container port named gcs-server
        #include_webui: 'true'
        object-store-memory: '100000000'
        # webui_host: "10.1.2.60"
        dashboard-host: '0.0.0.0'
        num-cpus: '2' # can be auto-completed from the limits
        node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
        block: 'true'
      #pod template
      template:
        metadata:
          labels:
            # custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
            # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
            rayCluster: raycluster-sample # will be injected if missing
            rayNodeType: head # will be injected if missing, must be head or wroker
            groupName: headgroup # will be injected if missing
          # annotations for pod
          annotations:
            key: value
        spec:
          nodeSelector:
            node.kubernetes.io/instance-type: n1-standard-8
          containers:
            - name: ray-head
              image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
              imagePullPolicy: Always
              #image: bonsaidev.azurecr.io/bonsai/lazer-0-9-0-cpu:dev
              env:
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
                - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                  value: "0"
              resources:
                limits:
                  cpu: 2
                  memory: 2Gi
                requests:
                  cpu: 2
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 3
        minReplicas: 3
        maxReplicas: 3
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # if worker pods need to be added, we can simply increment the replicas
        # if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
        # the operator will remove pods from the list until the number of replicas is satisfied
        # when a pod is confirmed to be deleted, its name will be removed from the list below
        #scaleStrategy:
        #  workersToDelete:
        #  - raycluster-complete-worker-small-group-bdtwh
        #  - raycluster-complete-worker-small-group-hv457
        #  - raycluster-complete-worker-small-group-k8tj7
        # the following params are used to complete the ray start: ray start --block --node-ip-address= ...
        rayStartParams:
          node-ip-address: $MY_POD_IP
          block: 'true'
        #pod template
        template:
          metadata:
            labels:
              key: value
            # annotations for pod
            annotations:
              key: value
          spec:
            nodeSelector:
              node.kubernetes.io/instance-type: n1-standard-8
            initContainers:
              # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
              - name: init-myservice
                image: busybox:1.28
                command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
            containers:
              - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
                imagePullPolicy: Always
                # environment variables to set in the container.Optional.
                # Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
                env:
                  - name:  RAY_DISABLE_DOCKER_CPU_WARNING
                    value: "1"
                  - name: TYPE
                    value: "worker"
                  - name: CPU_REQUEST
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.cpu
                  - name: CPU_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.cpu
                  - name: MEMORY_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.memory
                  - name: MEMORY_REQUESTS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.memory
                  - name: MY_POD_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: MY_POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                    value: "0"
                ports:
                  - containerPort: 80
                    name: client
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "2"
                    memory: "2Gi"
                  requests:
                    cpu: "2"
                    memory: "2Gi"

Here's what the user described:

  • I perform kubectl apply to create a namespace-scope KubeRay operator and the above RayService configuration
  • Once the Kubernetes service is up, I port-forward to it. localhost:8000/-/routes returns an empty JSON. There is no model at / like there should be
  • In the Ray dashboard, we see that a worker node is at 100% CPU usage. On this specific nodes, there are many log files created (logs in next message)
  • After something like 10 minutes, the Ray actors SERVE_REPLICA::FraudDetection#XXXXX boot and the CPU usage returns to normal
  • After this, the model is working as expected, and the route is added to /-/routes.

The user saw many log files on the worker node:

image
Sample log file
[2022-10-05 15:01:38,418 I 3038 3038] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 3038
[2022-10-05 15:01:38,424 I 3038 3038] grpc_server.cc:105: worker server started, listening on port 10116.
[2022-10-05 15:01:38,431 I 3038 3038] core_worker.cc:185: Initializing worker at address: 10.1.6.38:10116, worker ID afd0c46053c04eb898a1337a53c370b45330533025e991b85ca34380, raylet 4f8d80ae01f97ccd1a499c986c1aaee885dfa8d9bd4b47ec2eec1491
[2022-10-05 15:01:38,436 I 3038 3038] core_worker.cc:521: Adjusted worker niceness to 15
[2022-10-05 15:01:38,436 I 3038 3069] core_worker.cc:476: Event stats:

Global stats: 13 total (8 active)
Queueing time: mean = 56.736 us, max = 454.118 us, min = 51.559 us, total = 737.562 us
Execution time:  mean = 25.138 us, total = 326.790 us
Event stats:
    PeriodicalRunner.RunFnPeriodically - 6 total (3 active, 1 running), CPU time: mean = 8.508 us, total = 51.051 us
    UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
    WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 79.198 us, total = 79.198 us
    InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 196.541 us, total = 196.541 us
    InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s

[2022-10-05 15:01:38,436 I 3038 3038] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-10-05 15:01:38,436 I 3038 3069] accessor.cc:608: Received notification for node id = ea45e47b5d569afcf545ae75d2b5f7d970f5612c3d184a564b1b280d, IsAlive = 1
[2022-10-05 15:01:38,436 I 3038 3069] accessor.cc:608: Received notification for node id = 4abddd9f2921d99254261b560d5198f2ea3dedaeb33bbe71c8118a3e, IsAlive = 1
[2022-10-05 15:01:38,436 I 3038 3069] accessor.cc:608: Received notification for node id = 4f8d80ae01f97ccd1a499c986c1aaee885dfa8d9bd4b47ec2eec1491, IsAlive = 1
[2022-10-05 15:01:38,436 I 3038 3069] accessor.cc:608: Received notification for node id = 63a49bec1c6a8b7f6d1291aeeb8463ca3b310945a88c21809b0638c9, IsAlive = 1
[2022-10-05 15:01:40,212 I 3038 3069] core_worker.cc:2975: Cancelling a running task run_graph() thread id: 79db2a31f887a7bcffffffffffffffffffffffff01000000
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:606: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_USER_EXIT
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:593: Disconnecting to the raylet.
[2022-10-05 15:01:40,960 I 3038 3038] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Worker exits by an user request. max_call has reached, max_calls: 1, has creation_task_exception_pb_bytes=0
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:540: Shutting down a core worker.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:564: Disconnecting a GCS client.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:568: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-10-05 15:01:40,960 I 3038 3069] core_worker.cc:691: Core worker main io service stopped.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:577: Core worker ready to be deallocated.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker_process.cc:240: Task execution loop terminated. Removing the global worker.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:531: Core worker is destructed
[2022-10-05 15:01:41,250 I 3038 3038] core_worker_process.cc:144: Destructing CoreWorkerProcessImpl. pid: 3038
[2022-10-05 15:01:41,250 I 3038 3038] io_service_pool.cc:47: IOServicePool is stopped.

The .err files only contained :task_name:run_graph.

Reproduction script

See above for code and logs.

Anything else

No response

Are you willing to submit a PR?

shrekris-anyscale commented 2 years ago

We had a hypothesis that the user was running into #539, which was fixed by #540. The user upgraded Kuberay to master (which contains #540), but they could still reproduce the error.

  1. They upgraded the CRDs using
kubectl replace -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}&timeout=90s"
  1. Then, they created a new namespace-scope operator using the following config. It uses the image kuberay/operator:nightly:

    Operator Config
    apiVersion: v1
    kind: Namespace
    metadata:
      name: ray-serve-prototype-v2
      labels:
        app: ray
        owner: alexandre.gariepy
    ---
    #############################################################################################################
    # Operator-related
    #############################################################################################################
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      labels:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
        app: ray
        owner: alexandre.gariepy
      name: kuberay-operator
      namespace: ray-serve-prototype-v2
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
        owner: alexandre.gariepy
      name: kuberay-operator
      namespace: ray-serve-prototype-v2
    rules:
    - apiGroups:
      - coordination.k8s.io
      resources:
      - leases
      verbs:
      - create
      - get
      - list
      - update
    - apiGroups:
      - ""
      resources:
      - events
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    - apiGroups:
      - ""
      resources:
      - pods
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    - apiGroups:
      - ""
      resources:
      - pods/status
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    - apiGroups:
      - ""
      resources:
      - serviceaccounts
      verbs:
      - create
      - delete
      - get
      - list
      - watch
    - apiGroups:
      - ""
      resources:
      - services
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    - apiGroups:
      - ""
      resources:
      - services/status
      verbs:
      - get
      - patch
      - update
    - apiGroups:
      - extensions
      resources:
      - ingresses
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    - apiGroups:
      - networking.k8s.io
      resources:
      - ingressclasses
      verbs:
      - get
      - list
      - watch
    - apiGroups:
      - networking.k8s.io
      resources:
      - ingresses
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    - apiGroups:
      - ray.io
      resources:
      - rayclusters
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    - apiGroups:
      - ray.io
      resources:
      - rayclusters/finalizer
      verbs:
      - update
    - apiGroups:
      - ray.io
      resources:
      - rayclusters/status
      verbs:
      - get
      - patch
      - update
    - apiGroups:
      - ray.io
      resources:
      - rayjobs
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    - apiGroups:
      - ray.io
      resources:
      - rayjobs/finalizer
      verbs:
      - update
    - apiGroups:
      - ray.io
      resources:
      - rayjobs/status
      verbs:
      - get
      - patch
      - update
    - apiGroups:
      - ray.io
      resources:
      - rayservices
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    - apiGroups:
      - ray.io
      resources:
      - rayservices/finalizers
      verbs:
      - update
    - apiGroups:
      - ray.io
      resources:
      - rayservices/status
      verbs:
      - get
      - patch
      - update
    - apiGroups:
      - rbac.authorization.k8s.io
      resources:
      - rolebindings
      verbs:
      - create
      - delete
      - get
      - list
      - watch
    - apiGroups:
      - rbac.authorization.k8s.io
      resources:
      - roles
      verbs:
      - create
      - delete
      - get
      - list
      - update
      - watch
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      labels:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
        owner: alexandre.gariepy
      name: kuberay-operator-leader-election
      namespace: ray-serve-prototype-v2
    rules:
    - apiGroups:
      - ""
      resources:
      - configmaps
      verbs:
      - get
      - list
      - watch
      - create
      - update
      - patch
      - delete
    - apiGroups:
      - ""
      resources:
      - configmaps/status
      verbs:
      - get
      - update
      - patch
    - apiGroups:
      - ""
      resources:
      - events
      verbs:
      - create
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      labels:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
        owner: alexandre.gariepy
      name: kuberay-operator
      namespace: ray-serve-prototype-v2
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: kuberay-operator
    subjects:
    - kind: ServiceAccount
      name: kuberay-operator
      namespace: ray-serve-prototype-v2
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      labels:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
        owner: alexandre.gariepy
      name: kuberay-operator-leader-election
      namespace: ray-serve-prototype-v2
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: kuberay-operator-leader-election
    subjects:
    - kind: ServiceAccount
      name: kuberay-operator
      namespace: ray-serve-prototype-v2
    ---
    apiVersion: v1
    kind: Service
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      labels:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
        owner: alexandre.gariepy
      name: kuberay-operator
      namespace: ray-serve-prototype-v2
    spec:
      ports:
      - name: monitoring-port
        port: 8080
        targetPort: 8080
      selector:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
      type: ClusterIP
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
        app: ray
        owner: alexandre.gariepy
      name: kuberay-operator
      namespace: ray-serve-prototype-v2
    spec:
      replicas: 1
      selector:
        matchLabels:
          app.kubernetes.io/component: kuberay-operator
          app.kubernetes.io/name: kuberay
      template:
        metadata:
          labels:
            app.kubernetes.io/component: kuberay-operator
            app.kubernetes.io/name: kuberay
            owner: alexandre.gariepy
        spec:
          containers:
          - command:
            - /manager
            - -watch-namespace
            - ray-serve-prototype-v2
            image: kuberay/operator:nightly
            livenessProbe:
              failureThreshold: 5
              httpGet:
                path: /metrics
                port: http
              initialDelaySeconds: 10
              periodSeconds: 5
            name: kuberay-operator
            ports:
            - containerPort: 8080
              name: http
              protocol: TCP
            readinessProbe:
              failureThreshold: 5
              httpGet:
                path: /metrics
                port: http
              initialDelaySeconds: 10
              periodSeconds: 5
            resources:
              limits:
                cpu: 100m
                memory: 100Mi
              requests:
                cpu: 100m
                memory: 50Mi
            securityContext:
              allowPrivilegeEscalation: false
          securityContext:
            runAsNonRoot: true
          serviceAccountName: kuberay-operator
          terminationGracePeriodSeconds: 10
    
  2. Then, they deployed a RayService using the exact same config as the issue body:

    Kubernetes Config
    #############################################################################################################
    # RayService related
    #############################################################################################################
    # Make sure to increase resource requests and limits before using this example in production.
    # For examples with more realistic resource configuration, see
    # ray-cluster.complete.large.yaml and
    # ray-cluster.autoscaler.large.yaml.
    apiVersion: ray.io/v1alpha1
    kind: RayService
    metadata:
      name: rayservice-xgboost-model
    spec:
      serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for service. Default value is 60.
      deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for deployments. Default value is 60.
      serveConfig:
        importPath: example_xgboost.model
        deployments:
          - name: FraudDetection
            numReplicas: 3
            routePrefix: "/"
      rayClusterConfig:
        rayVersion: '2.0.0' # should match the Ray version in the image of the containers
        ######################headGroupSpecs#################################
        # head group template and specs, (perhaps 'group' is not needed in the name)
        headGroupSpec:
          # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
          serviceType: ClusterIP
          # the pod replicas in this group typed head (assuming there could be more than 1 in the future)
          replicas: 1
          # logical group name, for this called head-group, also can be functional
          # pod type head or worker
          # rayNodeType: head # Not needed since it is under the headgroup
          # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
          rayStartParams:
            port: '6379' # should match container port named gcs-server
            #include_webui: 'true'
            object-store-memory: '100000000'
            # webui_host: "10.1.2.60"
            dashboard-host: '0.0.0.0'
            num-cpus: '2' # can be auto-completed from the limits
            node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
            block: 'true'
          #pod template
          template:
            metadata:
              labels:
                # custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
                # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
                rayCluster: raycluster-sample # will be injected if missing
                rayNodeType: head # will be injected if missing, must be head or wroker
                groupName: headgroup # will be injected if missing
              # annotations for pod
              annotations:
                key: value
            spec:
              nodeSelector:
                node.kubernetes.io/instance-type: n1-standard-8
              containers:
                - name: ray-head
                  image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
                  imagePullPolicy: Always
                  #image: bonsaidev.azurecr.io/bonsai/lazer-0-9-0-cpu:dev
                  env:
                    - name: MY_POD_IP
                      valueFrom:
                        fieldRef:
                          fieldPath: status.podIP
                    - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                      value: "0"
                  resources:
                    limits:
                      cpu: 2
                      memory: 2Gi
                    requests:
                      cpu: 2
                      memory: 2Gi
                  ports:
                    - containerPort: 6379
                      name: gcs-server
                    - containerPort: 8265 # Ray dashboard
                      name: dashboard
                    - containerPort: 10001
                      name: client
                    - containerPort: 8000
                      name: serve
        workerGroupSpecs:
          # the pod replicas in this group typed worker
          - replicas: 3
            minReplicas: 3
            maxReplicas: 3
            # logical group name, for this called small-group, also can be functional
            groupName: small-group
            # if worker pods need to be added, we can simply increment the replicas
            # if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
            # the operator will remove pods from the list until the number of replicas is satisfied
            # when a pod is confirmed to be deleted, its name will be removed from the list below
            #scaleStrategy:
            #  workersToDelete:
            #  - raycluster-complete-worker-small-group-bdtwh
            #  - raycluster-complete-worker-small-group-hv457
            #  - raycluster-complete-worker-small-group-k8tj7
            # the following params are used to complete the ray start: ray start --block --node-ip-address= ...
            rayStartParams:
              node-ip-address: $MY_POD_IP
              block: 'true'
            #pod template
            template:
              metadata:
                labels:
                  key: value
                # annotations for pod
                annotations:
                  key: value
              spec:
                nodeSelector:
                  node.kubernetes.io/instance-type: n1-standard-8
                initContainers:
                  # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
                  - name: init-myservice
                    image: busybox:1.28
                    command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
                containers:
                  - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                    image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
                    imagePullPolicy: Always
                    # environment variables to set in the container.Optional.
                    # Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
                    env:
                      - name:  RAY_DISABLE_DOCKER_CPU_WARNING
                        value: "1"
                      - name: TYPE
                        value: "worker"
                      - name: CPU_REQUEST
                        valueFrom:
                          resourceFieldRef:
                            containerName: machine-learning
                            resource: requests.cpu
                      - name: CPU_LIMITS
                        valueFrom:
                          resourceFieldRef:
                            containerName: machine-learning
                            resource: limits.cpu
                      - name: MEMORY_LIMITS
                        valueFrom:
                          resourceFieldRef:
                            containerName: machine-learning
                            resource: limits.memory
                      - name: MEMORY_REQUESTS
                        valueFrom:
                          resourceFieldRef:
                            containerName: machine-learning
                            resource: requests.memory
                      - name: MY_POD_NAME
                        valueFrom:
                          fieldRef:
                            fieldPath: metadata.name
                      - name: MY_POD_IP
                        valueFrom:
                          fieldRef:
                            fieldPath: status.podIP
                      - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                        value: "0"
                    ports:
                      - containerPort: 80
                        name: client
                    lifecycle:
                      preStop:
                        exec:
                          command: ["/bin/sh","-c","ray stop"]
                    resources:
                      limits:
                        cpu: "2"
                        memory: "2Gi"
                      requests:
                        cpu: "2"
                        memory: "2Gi"
    

Note: In this case, the issue occurred the first time the user deployed the RayService, not when they upgraded an existing RayService.

kevin85421 commented 1 year ago

Hi @shrekris-anyscale, does this issue still persist?

shrekris-anyscale commented 1 year ago

@sihanwang41 is this issue resolved by your change in #1014?

sihanwang41 commented 1 year ago

@sihanwang41 is this issue resolved by your change in #1014?

yes!

shrekris-anyscale commented 1 year ago

Great! @kevin85421 this issue should be resolved, so I'm closing it.