Open neex opened 2 years ago
@wuisawesome could you help triage this?
Same story as OP 😢
More dets; issue also replicates with TPUs (just a couple is fine; maybe 4-8)
UPDATE: @hongchaodeng can you please take a look at this; rickyx@ can help with some of the context around AS in general. For help on reproing and an environment on GCP please grab thomas@ so you can get a GCP sandbox to proc spot preemptions on GCP if needed.
The issue is a known bug in the GCP provider of the cluster launcher.
The Ray autoscaler performs two primary functions:
The resource 'projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute' was not found
The problem arose during the first step. The cluster launcher code assumes that instances remain available once created. However, any external actions, such as manual termination or spot preemption, would disrupt this assumption. When such disruptions occur, the cluster launcher does not handle the unexpected exceptions properly and continuously retries the operation.
This behavior is due to the cluster launcher being designed primarily for bootstrapping and prototyping Ray projects. It is important to note that this issue does not affect the Anyscale platform, which uses a different proprietary autoscaler.
To avoid this problem, you may consider leveraging the autoscaling capabilities of Anyscale. Alternatively, you would need to implement additional steps to manage autoscaling effectively.
What happened + What you expected to happen
I use ray cluster with Google Cloud Platform for my tasks. One thing to note is that I use preemptible instances for workers (thus, Google may stop it anytime).
After a while (about 30-40 minutes of active usage), the scaling stops working: no new workers go up, and no old workers are destroyed after idle timeout (moreover, some workers are up but not initialized). I've debugged the issue down to something that looks like an infinite exception-restart loop in
/tmp/ray/session_latest/logs/monitor.log
at the head node; the relevant log part is:This exception repeats again and again with the same worker id
ray-research-worker-cbcbb628-compute
.The
ray-research-worker-cbcbb628-compute
instance seems to have indeed existed but does not exist at the moment of the exception (thus, a 404 response from the GCP is justified).I believe (though not sure) that situation is something like this:
The expected behavior is that the autoscaler should handle this case and continue to set up other workers, shut down idle ones, etc.
Versions / Dependencies
Google cloud platform is used, and preemptible instances are used for workers (see condig).
Reproduction script
Config:
Script:
In order to reproduce the issue, you may have to submit the script to the cluster several times for the instance shutdown to be caught in the right state.
Issue Severity
High: It blocks me from completing my task.