[Bug] `RayService` restarts repeatedly when a new config is applied

shrekris-anyscale commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

A user had this Serve app:

Python Code

from starlette.requests import Request
import joblib
import ray
from ray import serve
import pandas as pd
import os
import logging
from typing import List
from xgboost import DMatrix

MODEL_PATH = os.environ.get("FRAUD_MODEL_PATH", "./model.joblib")

@serve.deployment(num_replicas=3)
class FraudDetection:
    def __init__(self):
        with open(MODEL_PATH, "rb") as f:
            model = joblib.load(f)
        self.preprocessor = model['preprocessors']
        self.clf = model['clf'].get_booster()

    def predict(self, features: pd.DataFrame) -> str:
        data = self.preprocessor.transform(features)
        predictions = self.clf.predict(DMatrix(data))
        return predictions

    async def __call__(self, http_request: Request) -> str:
        features_json: List[dict] = await http_request.json()
        return self.predict(pd.DataFrame(features_json))

model = FraudDetection.bind()

Docker Image

FROM rayproject/ray:2.0.0-py38-cu102

COPY requirements.txt .

RUN pip install --no-cache-dir --upgrade pip \
  && pip install --no-cache-dir -r requirements.txt \
  && rm requirements.txt

COPY src/* .

ENV FRAUD_MODEL_PATH=/home/ray/model.joblib

Kubernetes Config

#############################################################################################################
# RayService related
#############################################################################################################
# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-xgboost-model
spec:
  serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for service. Default value is 60.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for deployments. Default value is 60.
  serveConfig:
    importPath: example_xgboost.model
    deployments:
      - name: FraudDetection
        numReplicas: 3
        routePrefix: "/"
  rayClusterConfig:
    rayVersion: '2.0.0' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # head group template and specs, (perhaps 'group' is not needed in the name)
    headGroupSpec:
      # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
      serviceType: ClusterIP
      # the pod replicas in this group typed head (assuming there could be more than 1 in the future)
      replicas: 1
      # logical group name, for this called head-group, also can be functional
      # pod type head or worker
      # rayNodeType: head # Not needed since it is under the headgroup
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        port: '6379' # should match container port named gcs-server
        #include_webui: 'true'
        object-store-memory: '100000000'
        # webui_host: "10.1.2.60"
        dashboard-host: '0.0.0.0'
        num-cpus: '2' # can be auto-completed from the limits
        node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
        block: 'true'
      #pod template
      template:
        metadata:
          labels:
            # custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
            # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
            rayCluster: raycluster-sample # will be injected if missing
            rayNodeType: head # will be injected if missing, must be head or wroker
            groupName: headgroup # will be injected if missing
          # annotations for pod
          annotations:
            key: value
        spec:
          nodeSelector:
            node.kubernetes.io/instance-type: n1-standard-8
          containers:
            - name: ray-head
              image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
              imagePullPolicy: Always
              #image: bonsaidev.azurecr.io/bonsai/lazer-0-9-0-cpu:dev
              env:
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
                - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                  value: "0"
              resources:
                limits:
                  cpu: 2
                  memory: 2Gi
                requests:
                  cpu: 2
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 3
        minReplicas: 3
        maxReplicas: 3
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # if worker pods need to be added, we can simply increment the replicas
        # if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
        # the operator will remove pods from the list until the number of replicas is satisfied
        # when a pod is confirmed to be deleted, its name will be removed from the list below
        #scaleStrategy:
        #  workersToDelete:
        #  - raycluster-complete-worker-small-group-bdtwh
        #  - raycluster-complete-worker-small-group-hv457
        #  - raycluster-complete-worker-small-group-k8tj7
        # the following params are used to complete the ray start: ray start --block --node-ip-address= ...
        rayStartParams:
          node-ip-address: $MY_POD_IP
          block: 'true'
        #pod template
        template:
          metadata:
            labels:
              key: value
            # annotations for pod
            annotations:
              key: value
          spec:
            nodeSelector:
              node.kubernetes.io/instance-type: n1-standard-8
            initContainers:
              # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
              - name: init-myservice
                image: busybox:1.28
                command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
            containers:
              - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
                imagePullPolicy: Always
                # environment variables to set in the container.Optional.
                # Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
                env:
                  - name:  RAY_DISABLE_DOCKER_CPU_WARNING
                    value: "1"
                  - name: TYPE
                    value: "worker"
                  - name: CPU_REQUEST
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.cpu
                  - name: CPU_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.cpu
                  - name: MEMORY_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.memory
                  - name: MEMORY_REQUESTS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.memory
                  - name: MY_POD_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: MY_POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                    value: "0"
                ports:
                  - containerPort: 80
                    name: client
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "2"
                    memory: "2Gi"
                  requests:
                    cpu: "2"
                    memory: "2Gi"

Here's what the user described:

I perform kubectl apply to create a namespace-scope KubeRay operator and the above RayService configuration

Once the Kubernetes service is up, I port-forward to it. localhost:8000/-/routes returns an empty JSON. There is no model at / like there should be

In the Ray dashboard, we see that a worker node is at 100% CPU usage. On this specific nodes, there are many log files created (logs in next message)

After something like 10 minutes, the Ray actors SERVE_REPLICA::FraudDetection#XXXXX boot and the CPU usage returns to normal

After this, the model is working as expected, and the route is added to /-/routes.

The user saw many log files on the worker node:

Sample log file

[2022-10-05 15:01:38,418 I 3038 3038] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 3038
[2022-10-05 15:01:38,424 I 3038 3038] grpc_server.cc:105: worker server started, listening on port 10116.
[2022-10-05 15:01:38,431 I 3038 3038] core_worker.cc:185: Initializing worker at address: 10.1.6.38:10116, worker ID afd0c46053c04eb898a1337a53c370b45330533025e991b85ca34380, raylet 4f8d80ae01f97ccd1a499c986c1aaee885dfa8d9bd4b47ec2eec1491
[2022-10-05 15:01:38,436 I 3038 3038] core_worker.cc:521: Adjusted worker niceness to 15
[2022-10-05 15:01:38,436 I 3038 3069] core_worker.cc:476: Event stats:

Global stats: 13 total (8 active)
Queueing time: mean = 56.736 us, max = 454.118 us, min = 51.559 us, total = 737.562 us
Execution time:  mean = 25.138 us, total = 326.790 us
Event stats:
    PeriodicalRunner.RunFnPeriodically - 6 total (3 active, 1 running), CPU time: mean = 8.508 us, total = 51.051 us
    UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
    WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 79.198 us, total = 79.198 us
    InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 196.541 us, total = 196.541 us
    InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s

[2022-10-05 15:01:38,436 I 3038 3038] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-10-05 15:01:38,436 I 3038 3069] accessor.cc:608: Received notification for node id = ea45e47b5d569afcf545ae75d2b5f7d970f5612c3d184a564b1b280d, IsAlive = 1
[2022-10-05 15:01:38,436 I 3038 3069] accessor.cc:608: Received notification for node id = 4abddd9f2921d99254261b560d5198f2ea3dedaeb33bbe71c8118a3e, IsAlive = 1
[2022-10-05 15:01:38,436 I 3038 3069] accessor.cc:608: Received notification for node id = 4f8d80ae01f97ccd1a499c986c1aaee885dfa8d9bd4b47ec2eec1491, IsAlive = 1
[2022-10-05 15:01:38,436 I 3038 3069] accessor.cc:608: Received notification for node id = 63a49bec1c6a8b7f6d1291aeeb8463ca3b310945a88c21809b0638c9, IsAlive = 1
[2022-10-05 15:01:40,212 I 3038 3069] core_worker.cc:2975: Cancelling a running task run_graph() thread id: 79db2a31f887a7bcffffffffffffffffffffffff01000000
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:606: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_USER_EXIT
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:593: Disconnecting to the raylet.
[2022-10-05 15:01:40,960 I 3038 3038] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Worker exits by an user request. max_call has reached, max_calls: 1, has creation_task_exception_pb_bytes=0
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:540: Shutting down a core worker.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:564: Disconnecting a GCS client.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:568: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-10-05 15:01:40,960 I 3038 3069] core_worker.cc:691: Core worker main io service stopped.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:577: Core worker ready to be deallocated.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker_process.cc:240: Task execution loop terminated. Removing the global worker.
[2022-10-05 15:01:40,960 I 3038 3038] core_worker.cc:531: Core worker is destructed
[2022-10-05 15:01:41,250 I 3038 3038] core_worker_process.cc:144: Destructing CoreWorkerProcessImpl. pid: 3038
[2022-10-05 15:01:41,250 I 3038 3038] io_service_pool.cc:47: IOServicePool is stopped.

The .err files only contained :task_name:run_graph.

Reproduction script

See above for code and logs.

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

shrekris-anyscale commented 2 years ago

We had a hypothesis that the user was running into #539, which was fixed by #540. The user upgraded Kuberay to master (which contains #540), but they could still reproduce the error.

They upgraded the CRDs using

kubectl replace -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}&timeout=90s"

Then, they created a new namespace-scope operator using the following config. It uses the image kuberay/operator:nightly:

Operator Config

apiVersion: v1
kind: Namespace
metadata:
  name: ray-serve-prototype-v2
  labels:
    app: ray
    owner: alexandre.gariepy
---
#############################################################################################################
# Operator-related
#############################################################################################################
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
    app: ray
    owner: alexandre.gariepy
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  creationTimestamp: null
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
    owner: alexandre.gariepy
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
rules:
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - create
  - get
  - list
  - update
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ""
  resources:
  - serviceaccounts
  verbs:
  - create
  - delete
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - services
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ""
  resources:
  - services/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingressclasses
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ray.io
  resources:
  - rayclusters
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ray.io
  resources:
  - rayclusters/finalizer
  verbs:
  - update
- apiGroups:
  - ray.io
  resources:
  - rayclusters/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - ray.io
  resources:
  - rayjobs
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ray.io
  resources:
  - rayjobs/finalizer
  verbs:
  - update
- apiGroups:
  - ray.io
  resources:
  - rayjobs/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - ray.io
  resources:
  - rayservices
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ray.io
  resources:
  - rayservices/finalizers
  verbs:
  - update
- apiGroups:
  - ray.io
  resources:
  - rayservices/status
  verbs:
  - get
  - patch
  - update
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - rolebindings
  verbs:
  - create
  - delete
  - get
  - list
  - watch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles
  verbs:
  - create
  - delete
  - get
  - list
  - update
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
    owner: alexandre.gariepy
  name: kuberay-operator-leader-election
  namespace: ray-serve-prototype-v2
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - configmaps/status
  verbs:
  - get
  - update
  - patch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
    owner: alexandre.gariepy
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kuberay-operator
subjects:
- kind: ServiceAccount
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
    owner: alexandre.gariepy
  name: kuberay-operator-leader-election
  namespace: ray-serve-prototype-v2
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kuberay-operator-leader-election
subjects:
- kind: ServiceAccount
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8080"
    prometheus.io/scrape: "true"
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
    owner: alexandre.gariepy
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
spec:
  ports:
  - name: monitoring-port
    port: 8080
    targetPort: 8080
  selector:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: kuberay-operator
    app.kubernetes.io/name: kuberay
    app: ray
    owner: alexandre.gariepy
  name: kuberay-operator
  namespace: ray-serve-prototype-v2
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: kuberay-operator
      app.kubernetes.io/name: kuberay
  template:
    metadata:
      labels:
        app.kubernetes.io/component: kuberay-operator
        app.kubernetes.io/name: kuberay
        owner: alexandre.gariepy
    spec:
      containers:
      - command:
        - /manager
        - -watch-namespace
        - ray-serve-prototype-v2
        image: kuberay/operator:nightly
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: http
          initialDelaySeconds: 10
          periodSeconds: 5
        name: kuberay-operator
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: http
          initialDelaySeconds: 10
          periodSeconds: 5
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
      securityContext:
        runAsNonRoot: true
      serviceAccountName: kuberay-operator
      terminationGracePeriodSeconds: 10

Then, they deployed a RayService using the exact same config as the issue body:

Kubernetes Config

#############################################################################################################
# RayService related
#############################################################################################################
# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-xgboost-model
spec:
  serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for service. Default value is 60.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for deployments. Default value is 60.
  serveConfig:
    importPath: example_xgboost.model
    deployments:
      - name: FraudDetection
        numReplicas: 3
        routePrefix: "/"
  rayClusterConfig:
    rayVersion: '2.0.0' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # head group template and specs, (perhaps 'group' is not needed in the name)
    headGroupSpec:
      # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
      serviceType: ClusterIP
      # the pod replicas in this group typed head (assuming there could be more than 1 in the future)
      replicas: 1
      # logical group name, for this called head-group, also can be functional
      # pod type head or worker
      # rayNodeType: head # Not needed since it is under the headgroup
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        port: '6379' # should match container port named gcs-server
        #include_webui: 'true'
        object-store-memory: '100000000'
        # webui_host: "10.1.2.60"
        dashboard-host: '0.0.0.0'
        num-cpus: '2' # can be auto-completed from the limits
        node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
        block: 'true'
      #pod template
      template:
        metadata:
          labels:
            # custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
            # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
            rayCluster: raycluster-sample # will be injected if missing
            rayNodeType: head # will be injected if missing, must be head or wroker
            groupName: headgroup # will be injected if missing
          # annotations for pod
          annotations:
            key: value
        spec:
          nodeSelector:
            node.kubernetes.io/instance-type: n1-standard-8
          containers:
            - name: ray-head
              image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
              imagePullPolicy: Always
              #image: bonsaidev.azurecr.io/bonsai/lazer-0-9-0-cpu:dev
              env:
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
                - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                  value: "0"
              resources:
                limits:
                  cpu: 2
                  memory: 2Gi
                requests:
                  cpu: 2
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 3
        minReplicas: 3
        maxReplicas: 3
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # if worker pods need to be added, we can simply increment the replicas
        # if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
        # the operator will remove pods from the list until the number of replicas is satisfied
        # when a pod is confirmed to be deleted, its name will be removed from the list below
        #scaleStrategy:
        #  workersToDelete:
        #  - raycluster-complete-worker-small-group-bdtwh
        #  - raycluster-complete-worker-small-group-hv457
        #  - raycluster-complete-worker-small-group-k8tj7
        # the following params are used to complete the ray start: ray start --block --node-ip-address= ...
        rayStartParams:
          node-ip-address: $MY_POD_IP
          block: 'true'
        #pod template
        template:
          metadata:
            labels:
              key: value
            # annotations for pod
            annotations:
              key: value
          spec:
            nodeSelector:
              node.kubernetes.io/instance-type: n1-standard-8
            initContainers:
              # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
              - name: init-myservice
                image: busybox:1.28
                command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
            containers:
              - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
                imagePullPolicy: Always
                # environment variables to set in the container.Optional.
                # Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
                env:
                  - name:  RAY_DISABLE_DOCKER_CPU_WARNING
                    value: "1"
                  - name: TYPE
                    value: "worker"
                  - name: CPU_REQUEST
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.cpu
                  - name: CPU_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.cpu
                  - name: MEMORY_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.memory
                  - name: MEMORY_REQUESTS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.memory
                  - name: MY_POD_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: MY_POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
                    value: "0"
                ports:
                  - containerPort: 80
                    name: client
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "2"
                    memory: "2Gi"
                  requests:
                    cpu: "2"
                    memory: "2Gi"

Note: In this case, the issue occurred the first time the user deployed the RayService, not when they upgraded an existing RayService.

kevin85421 commented 1 year ago

Hi @shrekris-anyscale, does this issue still persist?

shrekris-anyscale commented 1 year ago

@sihanwang41 is this issue resolved by your change in #1014?

sihanwang41 commented 1 year ago

@sihanwang41 is this issue resolved by your change in #1014?

yes!

shrekris-anyscale commented 1 year ago

Great! @kevin85421 this issue should be resolved, so I'm closing it.

ray-project / kuberay