Ray Actor Timeout breaks cluster in that workers can no longer be ssh'd

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Apache License 2.0

34.2k stars 5.81k forks source link

What happened + What you expected to happen

When requesting resources for a ray cluster, when the actor times out (could be error in the minimal code example) it seems to leave the ray workers in a state where they don't respond to ssh.

1) ray up ray.yml 2) ray dashboard ray.yml
3) seq 6 | parallel -n0 ray job submit --entrypoint-num-gpus 1 --entrypoint-num-cpus 24 --working-dir . -- nvcc --version // this successfully loads 6 gpu workers 4) time ray job submit --runtime-env ./ray_runtime_env.yml --address http://localhost:8265 -- python test.py // this is the problematic command where the worker will timeout and return command line 5) seq 6 | parallel -n0 ray job submit --entrypoint-num-gpus 1 --entrypoint-num-cpus 24 --working-dir . -- nvcc --version // when running the following command that ran fine at step 3), instead this instance the ray head node will be stuck not able to communicate with the workers.

when looking at the monitor.out file I see that the ssh to the workers is not working. however when you attach to the head and ssh from there I can ssh into the workers. (also workers are running fine in azure console

Versions / Dependencies

ray 3.0.0.dev

Reproduction script

ray.yml.txt ray-env2.yml.txt test2.py.txt ray_runtime_env2.yml.txt

Issue Severity

High: It blocks me from completing my task.

Updated Hypothesis:

Ray cluster launcher can not recover cached-stopped-nodes when nodes fail (e.g. from overcapacity) on the azure cloud end.

Observations

I set the cluster config to set the flag cache_stopped_nodes: False
all the problems of uninitialized nodes, problems with head node not being able to ssh, etc. go away.
The problematic node I am requesting is either an A100 or H100 gpu worker which is prone to provision errors since they are in high demand. In Azure Logs I notice that these VMs are often provisioned with errors or moved.
When these errors occur, I see multiple states -- VM in running state, but there is a provisioning error.
-- VM can be ssh'd into, but there is no ray docker loaded -- etc.
ray head node then remains stuck and cannot provision new worker nodes to heal this configuration.
only when I set cache_stopped_nodes: False in which the state of the VM is guaranteed, do I see reliable ray operation using these problematic A100, H100 workers.

Conclusions

I believe there is a state in ray cluster launcher in which it can not recognize a provisioning error at the worker, and instead of gracefully falling back to a protocol invoked by cache_stopped_nodes: False, it persistently retries to communicated with the broken worker.

ray-project / ray