[Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT

smit-kiri commented 1 year ago

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I'm trying to move all our workloads from single application, to multi-application RayService with the release of KubeRay v0.6.0, and it does not seem possible to do it without downtime if we're using GCS FT. I see the following error:

ray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application 
config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either
redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a
multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the
the multi-app API endpoint `/api/serve/applications/`.

Reproduction script

Single application:

demo.py

```python import time from ray import serve from ray.serve.drivers import DAGDriver @serve.deployment(name="model1") class Model1: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model1" return data @serve.deployment(name="model2") class Model2: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model2" return data driver = DAGDriver.bind({"/model1": Model1.bind(), "/model2": Model2.bind()}) # type: ignore ```

Dockerfile

```Dockerfile FROM rayproject/ray:2.6.1-py310 as common ENV WORKING_DIR /home/ray/models WORKDIR ${WORKING_DIR} ADD ./demo.py ${WORKING_DIR} ```

rayservice_config.yaml

```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample annotations: ray.io/ft-enabled: 'true' ray.io/external-storage-namespace: rayservice-sample spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60. deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60. serveConfig: importPath: demo:driver deployments: - name: model1 numReplicas: 1 - name: model2 numReplicas: 1 rayClusterConfig: rayVersion: 2.6.1 # should match the Ray version in the image of the containers enableInTreeAutoscaling: true ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ... rayStartParams: port: '6379' # should match container port named gcs-server dashboard-host: 0.0.0.0 #pod template template: spec: containers: - name: ray-head image: DOCKER_IMAGE_URL imagePullPolicy: Always env: - name: RAY_LOG_TO_STDERR value: '1' - name: RAY_REDIS_ADDRESS value: redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379 resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 15 # logical group name, for this called small-group, also can be functional groupName: small-group rayStartParams: {} #pod template template: spec: containers: - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' image: DOCKER_IMAGE_URL imagePullPolicy: Always lifecycle: preStop: exec: command: [/bin/sh, -c, ray stop] resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ```

Multi-application

demo1.py

```python import time from ray import serve @serve.deployment(name="model1") class Model1: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): # Another dummy change data: dict = await http_request.json() data["model"] = "model1" return data model1 = Model1.bind() # type: ignore ```

demo2.py

```python import time from ray import serve @serve.deployment(name="model2") class Model2: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): # Dummy change data: dict = await http_request.json() data["model"] = "model2" return data model2 = Model2.bind() # type: ignore ```

Dockerfile

```Dockerfile FROM rayproject/ray:2.6.1-py310 as common ENV WORKING_DIR /home/ray/models WORKDIR ${WORKING_DIR} ADD ./model_deployments/demo.py ${WORKING_DIR} ADD ./model_deployments/demo2.py ${WORKING_DIR} ```

rayservice_config.yaml

```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample annotations: ray.io/ft-enabled: 'true' ray.io/external-storage-namespace: rayservice-sample spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60. deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60. serveConfigV2: applications: - name: app1 route_prefix: "/model1" import_path: "demo1:model1" deployments: - name: "model1" num_replicas: 1 - name: app2 route_prefix: "/model2" import_path: "demo2:model2" deployments: - name: "model2" num_replicas: 1 rayClusterConfig: rayVersion: 2.6.1 # should match the Ray version in the image of the containers enableInTreeAutoscaling: true ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ... rayStartParams: port: '6379' # should match container port named gcs-server dashboard-host: 0.0.0.0 #pod template template: spec: containers: - name: ray-head image: DOCKER_IMAGE_URL imagePullPolicy: Always env: - name: RAY_LOG_TO_STDERR value: '1' - name: RAY_REDIS_ADDRESS value: redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379 resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 15 # logical group name, for this called small-group, also can be functional groupName: small-group rayStartParams: {} #pod template template: spec: containers: - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' image: DOCKER_IMAGE_URL imagePullPolicy: Always lifecycle: preStop: exec: command: [/bin/sh, -c, ray stop] resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ```

Deploy the single application code first. Then try to deploy the multi-application code. You should see an error.

Anything else

A workaround here is:

Deploy the multi-app config without GCS FT.
Reboot the Redis instance
Add GCS FT back in again.

If you don't reboot the Redis instance, you run into the same error again when trying to add GCS FT back in.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

smit-kiri commented 1 year ago

This might be related to setting ray.io/external-storage-namespace explicitly

kevin85421 commented 1 year ago

Thank @smit-kiri for reporting this issue! This seems to have no relationship with GCS FT for me. I can reproduce the issue by:

Create a RayService using serveConfig (Ray Serve API V1) with this YAML.
Comment out serveConfig and uncomment serveConfigV2 in the YAML. Use kubectl apply to update the RayService.

Check KubeRay operator logs

2023-08-24T07:23:44.201Z  ERROR   controllers.RayService  fail to update deployment   {"error": "UpdateDeployments fail: 400 Bad Request \u001b[36mray::ServeController.deploy_apps()\u001b[39m (pid=302, ip=10.244.0.6, actor_id=26b13b037565cbf4d5afcb5701000000, repr=<ray.serve.controller.ServeController object at 0x7f50291c2490>)\n  File \"/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py\", line 428, in result\n    return self.__get_result()\n  File \"/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py\", line 384, in __get_result\n    raise self._exception\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/controller.py\", line 538, in deploy_apps\n    \"You are trying to deploy a multi-application config, however \"\nray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the the multi-app API endpoint `/api/serve/applications/`."}

Ray Serve seems not to allow in-place upgrades between API V1 (single app) and API V2 (multi app). A workaround involves not only updating serveConfig / serveConfigV2 but also modifying rayVersion which has no effect when the Ray version is 2.0.0 or later to 2.100.0. This will trigger a new RayCluster preparation instead of an in-place update.

smit-kiri commented 1 year ago

Thanks @kevin85421 ! I was able to get around it by setting a different ray.io/external-storage-namespace

kevin85421 commented 1 year ago

Thanks @kevin85421 ! I was able to get around it by setting a different ray.io/external-storage-namespace

Cool. I am still a bit confused. Do you only update serveConfig / serveConfigV2, or do you also update other fields? In the former case, it will only update the serve configurations in-place, while the latter case will trigger a zero-downtime upgrade. In my understanding, the former case will always report the exception ray.serve.exceptions.RayServeException whenever you upgrade from API V1 to API V2. If you trigger a zero-downtime upgrade, the different ray.io/external-storage-namespace solution makes sense to me.

smit-kiri commented 1 year ago

We triggered a zero-downtime upgrade by updating the docker image

kevin85421 commented 1 year ago

Update the doc: https://github.com/ray-project/ray/pull/38647/commits/ec19d1556f06587f82b7d554ad71cfc6cfda566a.

kevin85421 commented 1 year ago

https://github.com/ray-project/ray/pull/38647 is merged. Close this issue.

ray-project / kuberay