ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.19k stars 386 forks source link

[Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT #1297

Closed smit-kiri closed 1 year ago

smit-kiri commented 1 year ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

I'm trying to move all our workloads from single application, to multi-application RayService with the release of KubeRay v0.6.0, and it does not seem possible to do it without downtime if we're using GCS FT. I see the following error:

ray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application 
config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either
redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a
multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the
the multi-app API endpoint `/api/serve/applications/`.

Reproduction script

Single application:

demo.py ```python import time from ray import serve from ray.serve.drivers import DAGDriver @serve.deployment(name="model1") class Model1: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model1" return data @serve.deployment(name="model2") class Model2: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model2" return data driver = DAGDriver.bind({"/model1": Model1.bind(), "/model2": Model2.bind()}) # type: ignore ```
Dockerfile ```Dockerfile FROM rayproject/ray:2.6.1-py310 as common ENV WORKING_DIR /home/ray/models WORKDIR ${WORKING_DIR} ADD ./demo.py ${WORKING_DIR} ```
rayservice_config.yaml ```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample annotations: ray.io/ft-enabled: 'true' ray.io/external-storage-namespace: rayservice-sample spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60. deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60. serveConfig: importPath: demo:driver deployments: - name: model1 numReplicas: 1 - name: model2 numReplicas: 1 rayClusterConfig: rayVersion: 2.6.1 # should match the Ray version in the image of the containers enableInTreeAutoscaling: true ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ... rayStartParams: port: '6379' # should match container port named gcs-server dashboard-host: 0.0.0.0 #pod template template: spec: containers: - name: ray-head image: DOCKER_IMAGE_URL imagePullPolicy: Always env: - name: RAY_LOG_TO_STDERR value: '1' - name: RAY_REDIS_ADDRESS value: redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379 resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 15 # logical group name, for this called small-group, also can be functional groupName: small-group rayStartParams: {} #pod template template: spec: containers: - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' image: DOCKER_IMAGE_URL imagePullPolicy: Always lifecycle: preStop: exec: command: [/bin/sh, -c, ray stop] resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ```

Multi-application

demo1.py ```python import time from ray import serve @serve.deployment(name="model1") class Model1: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): # Another dummy change data: dict = await http_request.json() data["model"] = "model1" return data model1 = Model1.bind() # type: ignore ```
demo2.py ```python import time from ray import serve @serve.deployment(name="model2") class Model2: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): # Dummy change data: dict = await http_request.json() data["model"] = "model2" return data model2 = Model2.bind() # type: ignore ```
Dockerfile ```Dockerfile FROM rayproject/ray:2.6.1-py310 as common ENV WORKING_DIR /home/ray/models WORKDIR ${WORKING_DIR} ADD ./model_deployments/demo.py ${WORKING_DIR} ADD ./model_deployments/demo2.py ${WORKING_DIR} ```
rayservice_config.yaml ```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample annotations: ray.io/ft-enabled: 'true' ray.io/external-storage-namespace: rayservice-sample spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60. deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60. serveConfigV2: applications: - name: app1 route_prefix: "/model1" import_path: "demo1:model1" deployments: - name: "model1" num_replicas: 1 - name: app2 route_prefix: "/model2" import_path: "demo2:model2" deployments: - name: "model2" num_replicas: 1 rayClusterConfig: rayVersion: 2.6.1 # should match the Ray version in the image of the containers enableInTreeAutoscaling: true ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ... rayStartParams: port: '6379' # should match container port named gcs-server dashboard-host: 0.0.0.0 #pod template template: spec: containers: - name: ray-head image: DOCKER_IMAGE_URL imagePullPolicy: Always env: - name: RAY_LOG_TO_STDERR value: '1' - name: RAY_REDIS_ADDRESS value: redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379 resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 15 # logical group name, for this called small-group, also can be functional groupName: small-group rayStartParams: {} #pod template template: spec: containers: - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' image: DOCKER_IMAGE_URL imagePullPolicy: Always lifecycle: preStop: exec: command: [/bin/sh, -c, ray stop] resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ```

Deploy the single application code first. Then try to deploy the multi-application code. You should see an error.

Anything else

A workaround here is:

If you don't reboot the Redis instance, you run into the same error again when trying to add GCS FT back in.

Are you willing to submit a PR?

smit-kiri commented 1 year ago

This might be related to setting ray.io/external-storage-namespace explicitly

kevin85421 commented 1 year ago

Thank @smit-kiri for reporting this issue! This seems to have no relationship with GCS FT for me. I can reproduce the issue by:

Ray Serve seems not to allow in-place upgrades between API V1 (single app) and API V2 (multi app). A workaround involves not only updating serveConfig / serveConfigV2 but also modifying rayVersion which has no effect when the Ray version is 2.0.0 or later to 2.100.0. This will trigger a new RayCluster preparation instead of an in-place update.

smit-kiri commented 1 year ago

Thanks @kevin85421 ! I was able to get around it by setting a different ray.io/external-storage-namespace

kevin85421 commented 1 year ago

Thanks @kevin85421 ! I was able to get around it by setting a different ray.io/external-storage-namespace

Cool. I am still a bit confused. Do you only update serveConfig / serveConfigV2, or do you also update other fields? In the former case, it will only update the serve configurations in-place, while the latter case will trigger a zero-downtime upgrade. In my understanding, the former case will always report the exception ray.serve.exceptions.RayServeException whenever you upgrade from API V1 to API V2. If you trigger a zero-downtime upgrade, the different ray.io/external-storage-namespace solution makes sense to me.

smit-kiri commented 1 year ago

We triggered a zero-downtime upgrade by updating the docker image

kevin85421 commented 1 year ago

Update the doc: https://github.com/ray-project/ray/pull/38647/commits/ec19d1556f06587f82b7d554ad71cfc6cfda566a.

kevin85421 commented 1 year ago

https://github.com/ray-project/ray/pull/38647 is merged. Close this issue.