Closed smit-kiri closed 1 year ago
This might be related to setting ray.io/external-storage-namespace
explicitly
Thank @smit-kiri for reporting this issue! This seems to have no relationship with GCS FT for me. I can reproduce the issue by:
serveConfig
(Ray Serve API V1) with this YAML.serveConfig
and uncomment serveConfigV2
in the YAML. Use kubectl apply
to update the RayService.Check KubeRay operator logs
2023-08-24T07:23:44.201Z ERROR controllers.RayService fail to update deployment {"error": "UpdateDeployments fail: 400 Bad Request \u001b[36mray::ServeController.deploy_apps()\u001b[39m (pid=302, ip=10.244.0.6, actor_id=26b13b037565cbf4d5afcb5701000000, repr=<ray.serve.controller.ServeController object at 0x7f50291c2490>)\n File \"/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py\", line 428, in result\n return self.__get_result()\n File \"/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py\", line 384, in __get_result\n raise self._exception\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/controller.py\", line 538, in deploy_apps\n \"You are trying to deploy a multi-application config, however \"\nray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the the multi-app API endpoint `/api/serve/applications/`."}
Ray Serve seems not to allow in-place upgrades between API V1 (single app) and API V2 (multi app). A workaround involves not only updating serveConfig / serveConfigV2
but also modifying rayVersion
which has no effect when the Ray version is 2.0.0 or later to 2.100.0
. This will trigger a new RayCluster preparation instead of an in-place update.
Thanks @kevin85421 !
I was able to get around it by setting a different ray.io/external-storage-namespace
Thanks @kevin85421 ! I was able to get around it by setting a different
ray.io/external-storage-namespace
Cool. I am still a bit confused. Do you only update serveConfig / serveConfigV2, or do you also update other fields? In the former case, it will only update the serve configurations in-place, while the latter case will trigger a zero-downtime upgrade. In my understanding, the former case will always report the exception ray.serve.exceptions.RayServeException
whenever you upgrade from API V1 to API V2. If you trigger a zero-downtime upgrade, the different ray.io/external-storage-namespace
solution makes sense to me.
We triggered a zero-downtime upgrade by updating the docker image
https://github.com/ray-project/ray/pull/38647 is merged. Close this issue.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I'm trying to move all our workloads from single application, to multi-application RayService with the release of KubeRay
v0.6.0
, and it does not seem possible to do it without downtime if we're using GCS FT. I see the following error:Reproduction script
Single application:
demo.py
```python import time from ray import serve from ray.serve.drivers import DAGDriver @serve.deployment(name="model1") class Model1: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model1" return data @serve.deployment(name="model2") class Model2: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): data: dict = await http_request.json() data["model"] = "model2" return data driver = DAGDriver.bind({"/model1": Model1.bind(), "/model2": Model2.bind()}) # type: ignore ```Dockerfile
```Dockerfile FROM rayproject/ray:2.6.1-py310 as common ENV WORKING_DIR /home/ray/models WORKDIR ${WORKING_DIR} ADD ./demo.py ${WORKING_DIR} ```rayservice_config.yaml
```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample annotations: ray.io/ft-enabled: 'true' ray.io/external-storage-namespace: rayservice-sample spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60. deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60. serveConfig: importPath: demo:driver deployments: - name: model1 numReplicas: 1 - name: model2 numReplicas: 1 rayClusterConfig: rayVersion: 2.6.1 # should match the Ray version in the image of the containers enableInTreeAutoscaling: true ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ... rayStartParams: port: '6379' # should match container port named gcs-server dashboard-host: 0.0.0.0 #pod template template: spec: containers: - name: ray-head image: DOCKER_IMAGE_URL imagePullPolicy: Always env: - name: RAY_LOG_TO_STDERR value: '1' - name: RAY_REDIS_ADDRESS value: redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379 resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 15 # logical group name, for this called small-group, also can be functional groupName: small-group rayStartParams: {} #pod template template: spec: containers: - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' image: DOCKER_IMAGE_URL imagePullPolicy: Always lifecycle: preStop: exec: command: [/bin/sh, -c, ray stop] resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ```Multi-application
demo1.py
```python import time from ray import serve @serve.deployment(name="model1") class Model1: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): # Another dummy change data: dict = await http_request.json() data["model"] = "model1" return data model1 = Model1.bind() # type: ignore ```demo2.py
```python import time from ray import serve @serve.deployment(name="model2") class Model2: def __int__(self): # Simulate an init method time.sleep(60) async def __call__(self, http_request): # Dummy change data: dict = await http_request.json() data["model"] = "model2" return data model2 = Model2.bind() # type: ignore ```Dockerfile
```Dockerfile FROM rayproject/ray:2.6.1-py310 as common ENV WORKING_DIR /home/ray/models WORKDIR ${WORKING_DIR} ADD ./model_deployments/demo.py ${WORKING_DIR} ADD ./model_deployments/demo2.py ${WORKING_DIR} ```rayservice_config.yaml
```yaml apiVersion: ray.io/v1alpha1 kind: RayService metadata: name: rayservice-sample annotations: ray.io/ft-enabled: 'true' ray.io/external-storage-namespace: rayservice-sample spec: serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60. deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60. serveConfigV2: applications: - name: app1 route_prefix: "/model1" import_path: "demo1:model1" deployments: - name: "model1" num_replicas: 1 - name: app2 route_prefix: "/model2" import_path: "demo2:model2" deployments: - name: "model2" num_replicas: 1 rayClusterConfig: rayVersion: 2.6.1 # should match the Ray version in the image of the containers enableInTreeAutoscaling: true ######################headGroupSpecs################################# # Ray head pod template. headGroupSpec: # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ... rayStartParams: port: '6379' # should match container port named gcs-server dashboard-host: 0.0.0.0 #pod template template: spec: containers: - name: ray-head image: DOCKER_IMAGE_URL imagePullPolicy: Always env: - name: RAY_LOG_TO_STDERR value: '1' - name: RAY_REDIS_ADDRESS value: redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379 resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 15 # logical group name, for this called small-group, also can be functional groupName: small-group rayStartParams: {} #pod template template: spec: containers: - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' image: DOCKER_IMAGE_URL imagePullPolicy: Always lifecycle: preStop: exec: command: [/bin/sh, -c, ray stop] resources: limits: cpu: 2 memory: 8Gi requests: cpu: 2 memory: 8Gi ```Deploy the single application code first. Then try to deploy the multi-application code. You should see an error.
Anything else
A workaround here is:
If you don't reboot the Redis instance, you run into the same error again when trying to add GCS FT back in.
Are you willing to submit a PR?