[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances

neex commented 2 years ago

What happened + What you expected to happen

I use ray cluster with Google Cloud Platform for my tasks. One thing to note is that I use preemptible instances for workers (thus, Google may stop it anytime).

After a while (about 30-40 minutes of active usage), the scaling stops working: no new workers go up, and no old workers are destroyed after idle timeout (moreover, some workers are up but not initialized). I've debugged the issue down to something that looks like an infinite exception-restart loop in /tmp/ray/session_latest/logs/monitor.log at the head node; the relevant log part is:

2022-10-26 13:26:33,018 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances?filter=%28%28status+%3D+PROVISIONING%29+OR+%28status+%3D+STAGI
NG%29+OR+%28status+%3D+RUNNING%29%29+AND+%28labels.ray-cluster-name+%3D+research%29&alt=json
2022-10-26 13:26:33,136 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json
2022-10-26 13:26:33,195 ERROR autoscaler.py:341 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 338, in update
    self._update()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 397, in _update
    self.process_completed_updates()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 732, in process_completed_updates
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 155, in internal_ip
    node = self._get_cached_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 217, in _get_cached_node
    return self._get_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 45, in method_with_retries
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 209, in _get_node
    instance = resource.get_instance(node_id=node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node.py", line 407, in get_instance
    .execute()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/http.py", line 851, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json returned "The resource 'projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute' was not found">
2022-10-26 13:26:33,196 CRITICAL autoscaler.py:350 -- StandardAutoscaler: Too many errors, abort.

This exception repeats again and again with the same worker id ray-research-worker-cbcbb628-compute.

The ray-research-worker-cbcbb628-compute instance seems to have indeed existed but does not exist at the moment of the exception (thus, a 404 response from the GCP is justified).

I believe (though not sure) that situation is something like this:

Ray started setting up the instance for worker and added it to some internal data structures.
At some point (probably during the setup), it was shut down as I use preemptible instances.
The Google Cloud Platform immediately forgot about it, starting to return 404 for all requests related to the instance.
The autoscaler did not handle this corner case correctly and did not remove it from their lists.

The expected behavior is that the autoscaler should handle this case and continue to set up other workers, shut down idle ones, etc.

Versions / Dependencies

$ ray --version
ray, version 2.0.1
$ python --version
Python 3.10.6
$ uname -a
Linux ray-research-head-3c5e32a6-compute 5.15.0-1021-gcp #28-Ubuntu SMP Fri Oct 14 15:46:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Ubuntu 22.04.1 LTS \n \l

Google cloud platform is used, and preemptible instances are used for workers (see condig).

Reproduction script

Config:

cluster_name: ray-debug
max_workers: 30

provider:
  type: gcp
  region: europe-west1
  availability_zone: europe-west1-c
  project_id: wunderfund-research

available_node_types:
    head:
        resources: {"CPU": 0}
        node_config:
            machineType: n2-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts

                  # ubuntu-2204-jammy-v20220712a
    worker:
        # memory 640 GB =  640*1024*1024*1024 = 687194767360
        resources: {"CPU": 1, "memory": 687194767360}
        node_config:
            machineType: n2-standard-2

            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
            scheduling:
              - preemptible: true
            serviceAccounts:
            - email: "ray-worker@wunderfund-research.iam.gserviceaccount.com"
              scopes:
              - https://www.googleapis.com/auth/cloud-platform

head_node_type: head
idle_timeout_minutes: 1
upscaling_speed: 2

auth:
   ssh_user: ubuntu

setup_commands:
  - sudo apt update
  - sudo DEBIAN_FRONTEND=noninteractive apt install python3-pip python-is-python3 -y
  - sudo pip install -U pip
  - sudo pip install ray[all]

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Script:

import time
import ray

def test_job(delay):
    time.sleep(delay)
    return f"Waited for {delay} secs"

def run_jobs():
    delays = [i * 10 for i in range(1, 30)]
    jobs = [ray.remote(test_job).options(num_cpus=1).remote(d) for d in delays]

    while jobs:
        done_ids, jobs = ray.wait(jobs)
        for ref in done_ids:
            result = ray.get(ref)
            print(ref, result)

if __name__ == "__main__":
    run_jobs()

In order to reproduce the issue, you may have to submit the script to the cluster several times for the instance shutdown to be caught in the right state.

Issue Severity

High: It blocks me from completing my task.

cadedaniel commented 2 years ago

@wuisawesome could you help triage this?

cheremovsky commented 2 years ago

Same story as OP 😢

anyscalesam commented 3 months ago

More dets; issue also replicates with TPUs (just a couple is fine; maybe 4-8)

UPDATE: @hongchaodeng can you please take a look at this; rickyx@ can help with some of the context around AS in general. For help on reproing and an environment on GCP please grab thomas@ so you can get a GCP sandbox to proc spot preemptions on GCP if needed.

hongchaodeng commented 3 months ago

The issue is a known bug in the GCP provider of the cluster launcher.

The Ray autoscaler performs two primary functions:

monitoring the current state of instances
making autoscaling decisions.

The resource 'projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute' was not found

The problem arose during the first step. The cluster launcher code assumes that instances remain available once created. However, any external actions, such as manual termination or spot preemption, would disrupt this assumption. When such disruptions occur, the cluster launcher does not handle the unexpected exceptions properly and continuously retries the operation.

This behavior is due to the cluster launcher being designed primarily for bootstrapping and prototyping Ray projects. It is important to note that this issue does not affect the Anyscale platform, which uses a different proprietary autoscaler.

To avoid this problem, you may consider leveraging the autoscaling capabilities of Anyscale. Alternatively, you would need to implement additional steps to manage autoscaling effectively.

ray-project / ray