ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.19k stars 386 forks source link

[Bug] Adding GCS FT breaks incremental RayService deployments #1296

Closed smit-kiri closed 1 year ago

smit-kiri commented 1 year ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When deploying a new custom image in RayService, KubeRay would spin up a new RayCluster. This RayCluster then connects to the same Redis as the currently running RayCluster. It thinks that all deployments are running and everything is healthy, so it cuts over all traffic to the new RayCluster and terminates the old one. However, it did not spin up any new deployments in the new RayCluster. So now it marks everything unheatlhy and spins everything back up. So we have a downtime until everything spins up.

Reproduction script

demo.py ```python import time from ray import serve from ray.serve.drivers import DAGDriver @serve.deployment(name="model1") class Model1: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model1" return data @serve.deployment(name="model2") class Model2: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model2" return data driver = DAGDriver.bind({"/model1": Model1.bind(), "/model2": Model2.bind()}) # type: ignore ```
Dockerfile ```Dockerfile FROM rayproject/ray:2.6.1-py310 as common ENV WORKING_DIR /home/ray/models WORKDIR ${WORKING_DIR} ADD ./demo.py ${WORKING_DIR} ```
rayservice_config.yaml ```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample annotations: ray.io/ft-enabled: 'true' ray.io/external-storage-namespace: rayservice-sample spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60. deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60. serveConfig: importPath: demo:driver deployments: - name: model1 numReplicas: 1 - name: model2 numReplicas: 1 rayClusterConfig: rayVersion: 2.6.1 # should match the Ray version in the image of the containers enableInTreeAutoscaling: true ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ... rayStartParams: port: '6379' # should match container port named gcs-server dashboard-host: 0.0.0.0 #pod template template: spec: containers: - name: ray-head image: DOCKER_IMAGE_URL imagePullPolicy: Always env: - name: RAY_LOG_TO_STDERR value: '1' - name: RAY_REDIS_ADDRESS value: redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379 resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 15 # logical group name, for this called small-group, also can be functional groupName: small-group rayStartParams: {} #pod template template: spec: containers: - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' image: DOCKER_IMAGE_URL imagePullPolicy: Always lifecycle: preStop: exec: command: [/bin/sh, -c, ray stop] resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ```

Build the docker image, modify DOCKER_IMAGE_URL to point to the image in rayservice_config.yaml. Apply the changes with kubectl apply -f rayservice_config.yaml. Once everything spins up, see the deployment statuses on the dashboard using kubectl port-forward service/rayservice-sample-head-svc 8265:8265.

Make a small change to demo.py, like adding a comment to the code. Build the new image and apply the same rayservice_config.yaml but with the new image URL. Monitor the Serve dashboard while the new changes are deployed. When the old ray pods start Terminating, you will notice the deployments turn unhealthy. Then the entire application disappears and a new application is deployed.

(This script was tested on AWS EKS, with the image present on ECR and using non-clustered Elasticache Redis)

Anything else

Using Ray 2.6.1 and Kuberay 0.6.0

Are you willing to submit a PR?

smit-kiri commented 1 year ago

A workaround here is to either not use ray.io/external-storage-namespace or change it at every deploy. I will still keep this issue open since this documentation is a little misleading and should have a caveat specified: https://ray-project.github.io/kuberay/guidance/gcs-ft/#external-storage-namespace

kevin85421 commented 1 year ago

Update the doc in https://github.com/ray-project/ray/pull/39450.

kevin85421 commented 1 year ago

I set torch in runtime_env, so the Ray Serve applications need several minutes to be ready after the RayCluster is created because torch requires several minutes to install. That is, "Experiment 1" terminates the old RayCluster without waiting the new RayCluster to be ready.

kevin85421 commented 1 year ago

I will close this issue after https://github.com/ray-project/ray/pull/39525 is merged.