[Autoscaler][GCP] Autoscaler crashing on GCP with error 404.

dirkweissenborn commented 2 years ago

What happened + What you expected to happen

Autoscalar crashes due to missing node and doesn't recover. It stays in the following update loop forever. The problem is that there is an updater dict from node_id to thread and if an error occurs the node is not dropped from that dict. There are many places where the internal_ip call could fail and I hope that there are not more of these possible pitfalls as they are really challenging to reproduce. These errors made the overall experience with ray really painful a couple of times now (e.g., #22537).

2022-11-07 10:31:46,568 ERROR monitor.py:382 -- Monitor: Execution exception. Trying again...
Traceback (most recent call last):
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/monitor.py", line 358, in _run
    self.autoscaler.update()
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 352, in update
    raise e
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 339, in update
    self._update()
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 398, in _update
    self.process_completed_updates()
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 733, in process_completed_updates
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 155, in internal_ip
    node = self._get_cached_node(node_id)
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 217, in _get_cached_node
    return self._get_node(node_id)
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 45, in method_with_retries
    return method(self, *args, **kwargs)
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 209, in _get_node
    instance = resource.get_instance(node_id=node_id)
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node.py", line 401, in get_instance
    self.resource.instances()
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/googleapiclient/http.py", line 938, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/inceptive-compute/zones/europe-west4-a/instances/ray-****-compute?alt=json returned "The resource '****' was not found". Details: "[{'message': "The resource '*****' was not found", 'domain': 'global', 'reason': 'notFound'}]">

Versions / Dependencies

2.0

Reproduction script

Hard to reproduce but it is probably almost the same as #22537

Issue Severity

Medium: It is a significant difficulty but I can work around it.

DmitriGekhtman commented 2 years ago

Support for deploying with GCE VMs is not well-maintained at the moment. I'd recommend looking into deploying on GKE https://docs.ray.io/en/master/cluster/kubernetes/index.html -- Kubernetes support is better maintained, and the GKE team itself is starting to promote Ray on GKE.

DmitriGekhtman commented 2 years ago

One strategy for resolving this --

The particular bit of autoscaler code that encounters this has a notion of "completed updates" and "failed updates". If we fail to extract the ip of the node, the update should be regarded as "failed".

Code reference: https://github.com/ray-project/ray/blob/134fa08637f4d0646c0ce442ed16cedbeeb14147/python/ray/autoscaler/_private/autoscaler.py#L733-L735

DmitriGekhtman commented 2 years ago

To fully resolve this issue without major code changes, we would need to examine each invocation of NodeProvider.internal_ip and ensure that failures are handled correctly.

A better solution would be to cache more information, reduce the number of network calls the autoscaler makes, and simplify the multithreading situation. I'm working on making such improvements for the autoscaler's Kubernetes integration.

ijrsvt commented 2 years ago

Are there any earlier logs, possibly from the NodeUpdaterThread that was responsible for launching the instance that ultimately caused the 404?

DmitriGekhtman commented 2 years ago

/tmp/ray/session_latest/monitor.* for those logs

Also, are you using spot instances?

dirkweissenborn commented 1 year ago

Hey, sorry for the late reply. Yes we are using spot instances. Sorry but I don't have those logs anymore. I think the worker wasn't down but the raylet crashed. We experience problems with larger workloads, it might have been due to us not setting ulimit on the worker nodes which themselves spawn new tasks but it is really hard to dig through the logs. Whatever caused the issue though I think ray should be robust when nodes go down or raylets crash.

hora-anyscale commented 1 year ago

Per Triage Sync: @dirkweissenborn Can you please share more details for repro? Are you still seeing the issue?

dirkweissenborn commented 1 year ago

Hey, the issue is gone now. It was happening because workers died with OOM. We fixed the OOM issue. In any case it feels like the autoscaler should be robust to these 404s.

ray-project / ray