I'm using Ray Serve on GCP for model serving. Problems occur intermittently when scale down happens and the autoscaler trys to delete idle nodes (GCP instances in this case). When I check 'monitor.log', it's stopped on the following status and not being updated anymore.
Usage:
28.0/64.0 CPU
1.0/4.0 GPU
1.0/1.0 Head
0B/241.34GiB memory
0B/104.08GiB object_store_memory
1.0/4.0 realistic
Demands:
(no resource demands)
2023-08-01 09:42:48,526 INFO autoscaler.py:588 -- StandardAutoscaler: Terminating the node with id ray-stable-diffusion-beta-worker-e73d42cc-compute and ip 10.178.15.226. (idle)
2023-08-01 09:42:48,526 INFO autoscaler.py:542 -- Node last used: Tue Aug 1 09:41:45 2023.
2023-08-01 09:42:48,526 INFO autoscaler.py:588 -- StandardAutoscaler: Terminating the node with id ray-stable-diffusion-beta-worker-b179692d-compute and ip 10.178.15.192. (idle)
2023-08-01 09:42:48,526 INFO autoscaler.py:542 -- Node last used: Tue Aug 1 09:41:45 2023.
2023-08-01 09:42:48,526 INFO autoscaler.py:588 -- StandardAutoscaler: Terminating the node with id ray-stable-diffusion-beta-worker-62bfab74-compute and ip 10.178.15.208. (idle)
2023-08-01 09:42:48,526 INFO autoscaler.py:542 -- Node last used: Tue Aug 1 09:41:45 2023.
2023-08-01 09:42:48,526 INFO autoscaler.py:676 -- Draining 3 raylet(s).
2023-08-01 09:42:48,528 INFO node_provider.py:186 -- NodeProvider: ray-stable-diffusion-beta-worker-e73d42cc-compute: Terminating node
2023-08-01 09:42:48,529 INFO node_provider.py:186 -- NodeProvider: ray-stable-diffusion-beta-worker-b179692d-compute: Terminating node
2023-08-01 09:42:48,540 INFO node_provider.py:186 -- NodeProvider: ray-stable-diffusion-beta-worker-62bfab74-compute: Terminating node
2023-08-01 09:42:50,062 INFO node.py:311 -- wait_for_compute_zone_operation: Waiting for operation operation-1690882969327-601d95ebe55b0-1c5f5aa0-b7200b69 to finish...
2023-08-01 09:42:50,187 INFO node.py:311 -- wait_for_compute_zone_operation: Waiting for operation operation-1690882969327-601d95ebe55b0-1c5f5aa0-b7200b69 to finish...
When I check the last pending GCP operation (operation-1690882969327-601d95ebe55b0-1c5f5aa0-b7200b69) with gcloud CLI, it's already done by status code 404. But after this bug happens, autoscaler stops working and not sclae up or down anymore.
What happened + What you expected to happen
I'm using Ray Serve on GCP for model serving. Problems occur intermittently when scale down happens and the autoscaler trys to delete idle nodes (GCP instances in this case). When I check 'monitor.log', it's stopped on the following status and not being updated anymore.
When I check the last pending GCP operation (operation-1690882969327-601d95ebe55b0-1c5f5aa0-b7200b69) with gcloud CLI, it's already done by status code 404. But after this bug happens, autoscaler stops working and not sclae up or down anymore.
Versions / Dependencies
Ray == 2.6.1 Python == 3.10
Reproduction script
My serve config is as following
Issue Severity
High: It blocks me from completing my task.