Closed smit-kiri closed 1 year ago
A workaround here is to either not use ray.io/external-storage-namespace
or change it at every deploy.
I will still keep this issue open since this documentation is a little misleading and should have a caveat specified: https://ray-project.github.io/kuberay/guidance/gcs-ft/#external-storage-namespace
Update the doc in https://github.com/ray-project/ray/pull/39450.
Experiment 1: Create a RayService with ray.io/external-storage-namespace
# Step 0: Install KubeRay v0.6.0
# Step 1: Create a RayService with `ray.io/external-storage-namespace`
# https://gist.github.com/kevin85421/0da88f8a2df905cc2bfbde0b9897878f
# Step 2: Update `rayVersion` to `2.6.100` to trigger zero-downtime upgrade.
Experiment 2: Create a RayService without ray.io/external-storage-namespace
# Step 0: Install KubeRay v0.6.0
# Step 1: Create a RayService without `ray.io/external-storage-namespace`
# https://gist.github.com/kevin85421/97d17bce92b5ca08902064205654667a
# Step 2: Update `rayVersion` to `2.6.100` to trigger zero-downtime upgrade.
I set torch
in runtime_env
, so the Ray Serve applications need several minutes to be ready after the RayCluster is created because torch
requires several minutes to install. That is, "Experiment 1" terminates the old RayCluster without waiting the new RayCluster to be ready.
I will close this issue after https://github.com/ray-project/ray/pull/39525 is merged.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When deploying a new custom image in RayService, KubeRay would spin up a new RayCluster. This RayCluster then connects to the same Redis as the currently running RayCluster. It thinks that all deployments are running and everything is healthy, so it cuts over all traffic to the new RayCluster and terminates the old one. However, it did not spin up any new deployments in the new RayCluster. So now it marks everything unheatlhy and spins everything back up. So we have a downtime until everything spins up.
Reproduction script
demo.py
```python import time from ray import serve from ray.serve.drivers import DAGDriver @serve.deployment(name="model1") class Model1: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model1" return data @serve.deployment(name="model2") class Model2: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model2" return data driver = DAGDriver.bind({"/model1": Model1.bind(), "/model2": Model2.bind()}) # type: ignore ```Dockerfile
```Dockerfile FROM rayproject/ray:2.6.1-py310 as common ENV WORKING_DIR /home/ray/models WORKDIR ${WORKING_DIR} ADD ./demo.py ${WORKING_DIR} ```rayservice_config.yaml
```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample annotations: ray.io/ft-enabled: 'true' ray.io/external-storage-namespace: rayservice-sample spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60. deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60. serveConfig: importPath: demo:driver deployments: - name: model1 numReplicas: 1 - name: model2 numReplicas: 1 rayClusterConfig: rayVersion: 2.6.1 # should match the Ray version in the image of the containers enableInTreeAutoscaling: true ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ... rayStartParams: port: '6379' # should match container port named gcs-server dashboard-host: 0.0.0.0 #pod template template: spec: containers: - name: ray-head image: DOCKER_IMAGE_URL imagePullPolicy: Always env: - name: RAY_LOG_TO_STDERR value: '1' - name: RAY_REDIS_ADDRESS value: redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379 resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 15 # logical group name, for this called small-group, also can be functional groupName: small-group rayStartParams: {} #pod template template: spec: containers: - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' image: DOCKER_IMAGE_URL imagePullPolicy: Always lifecycle: preStop: exec: command: [/bin/sh, -c, ray stop] resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ```Build the docker image, modify
DOCKER_IMAGE_URL
to point to the image inrayservice_config.yaml
. Apply the changes withkubectl apply -f rayservice_config.yaml
. Once everything spins up, see the deployment statuses on the dashboard usingkubectl port-forward service/rayservice-sample-head-svc 8265:8265
.Make a small change to
demo.py
, like adding a comment to the code. Build the new image and apply the samerayservice_config.yaml
but with the new image URL. Monitor the Serve dashboard while the new changes are deployed. When the old ray pods startTerminating
, you will notice the deployments turn unhealthy. Then the entire application disappears and a new application is deployed.(This script was tested on AWS EKS, with the image present on ECR and using non-clustered Elasticache Redis)
Anything else
Using Ray
2.6.1
and Kuberay0.6.0
Are you willing to submit a PR?