[Core][Autoscaler] Autoscaler stops while waiting for GCP compute operation

astron8t-voyagerx commented 1 year ago

What happened + What you expected to happen

I'm using Ray Serve on GCP for model serving. Problems occur intermittently when scale down happens and the autoscaler trys to delete idle nodes (GCP instances in this case). When I check 'monitor.log', it's stopped on the following status and not being updated anymore.

Usage:
 28.0/64.0 CPU
 1.0/4.0 GPU
 1.0/1.0 Head
 0B/241.34GiB memory
 0B/104.08GiB object_store_memory
 1.0/4.0 realistic

Demands:
 (no resource demands)
2023-08-01 09:42:48,526 INFO autoscaler.py:588 -- StandardAutoscaler: Terminating the node with id ray-stable-diffusion-beta-worker-e73d42cc-compute and ip 10.178.15.226. (idle)
2023-08-01 09:42:48,526 INFO autoscaler.py:542 -- Node last used: Tue Aug  1 09:41:45 2023.
2023-08-01 09:42:48,526 INFO autoscaler.py:588 -- StandardAutoscaler: Terminating the node with id ray-stable-diffusion-beta-worker-b179692d-compute and ip 10.178.15.192. (idle)
2023-08-01 09:42:48,526 INFO autoscaler.py:542 -- Node last used: Tue Aug  1 09:41:45 2023.
2023-08-01 09:42:48,526 INFO autoscaler.py:588 -- StandardAutoscaler: Terminating the node with id ray-stable-diffusion-beta-worker-62bfab74-compute and ip 10.178.15.208. (idle)
2023-08-01 09:42:48,526 INFO autoscaler.py:542 -- Node last used: Tue Aug  1 09:41:45 2023.
2023-08-01 09:42:48,526 INFO autoscaler.py:676 -- Draining 3 raylet(s).
2023-08-01 09:42:48,528 INFO node_provider.py:186 -- NodeProvider: ray-stable-diffusion-beta-worker-e73d42cc-compute: Terminating node
2023-08-01 09:42:48,529 INFO node_provider.py:186 -- NodeProvider: ray-stable-diffusion-beta-worker-b179692d-compute: Terminating node
2023-08-01 09:42:48,540 INFO node_provider.py:186 -- NodeProvider: ray-stable-diffusion-beta-worker-62bfab74-compute: Terminating node
2023-08-01 09:42:50,062 INFO node.py:311 -- wait_for_compute_zone_operation: Waiting for operation operation-1690882969327-601d95ebe55b0-1c5f5aa0-b7200b69 to finish...
2023-08-01 09:42:50,187 INFO node.py:311 -- wait_for_compute_zone_operation: Waiting for operation operation-1690882969327-601d95ebe55b0-1c5f5aa0-b7200b69 to finish...

When I check the last pending GCP operation (operation-1690882969327-601d95ebe55b0-1c5f5aa0-b7200b69) with gcloud CLI, it's already done by status code 404. But after this bug happens, autoscaler stops working and not sclae up or down anymore.

Versions / Dependencies

Ray == 2.6.1 Python == 3.10

Reproduction script

My serve config is as following

  deployments:

  - name: StableDiffusion
    max_concurrent_queries: 20
    autoscaling_config:
      min_replicas: 1
      initial_replicas: 2
      max_replicas: 4
      target_num_ongoing_requests_per_replica: 10.0
      metrics_interval_s: 10.0
      look_back_period_s: 900
      smoothing_factor: 1.0
      downscale_delay_s: 900
      upscale_delay_s: 900
    ray_actor_options:
      num_cpus: 12.0
      num_gpus: 1.0
      resources:
        realistic: 1.0

Issue Severity

High: It blocks me from completing my task.

rkooo567 commented 1 year ago

@rickyyx can you triage?

rickyyx commented 1 year ago

cc @architkulkarni do you have any insights that might be helpful here?

architkulkarni commented 1 year ago

Not sure about this one unfortunately :(

ray-project / ray