[Bug] "ray up" on GCP is still not working

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

Ray Clusters

Issue Severity

Medium: It contributes to significant difficulty to complete my task but I work arounds and get it resolved.

What happened + What you expected to happen

ray up failed when launching a head node on GCP during file_mounts processing. This bug was discussed in https://github.com/ray-project/ray/issues/16539 but seems it hasn't been fixed yet.

  [2/7] Processing file mounts                                                                                                                                          
    Running `mkdir -p ~/ray`                                                                                                                                                                                   
      Full command is `ssh -tt -i /home/eecs/weichiang/.ssh/ray-autoscaler_gcp_us-west1_intercloud-320520_gcpuser_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o Ex
itOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2a2051cb7c/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=120s gcpuser@34.82.
129.148 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~/ray)'`                                                                                   
Shared connection to 34.82.129.148 closed.                                                                                                                                                                     
    Running `rsync --rsh ssh -i /home/eecs/weichiang/.ssh/ray-autoscaler_gcp_us-west1_intercloud-320520_gcpuser_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o Exit
OnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2a2051cb7c/dc43e863c1/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz /home/eecs/
weichiang/repos/ray/ gcpuser@34.82.129.148:~/ray/`                                                                                                                                                             
sending incremental file list

...

doc/source/rllib/images/rllib-stack.png
doc/source/rllib/images/rllib-stack.svg
rsync: [sender] write error: Broken pipe (32)
rsync error: unexplained error (code 255) at io.c(829) [sender=3.1.2]
2022-03-19 17:38:10,320 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647736690088-5da9b98189a21-c3f70f37-1d4cd085 to finish...
2022-03-19 17:38:15,588 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647736690088-5da9b98189a21-c3f70f37-1d4cd085 finished.
  New status: update-failed
  !!!
  {'message': 'SSH command failed.'}
  SSH command failed.
  !!!

  Failed to setup head node.

This error can be reproduced when file_mounts contains a large number of files (such as a Ray repository). Possible reason might be the instability of SSH connection to VM during the first few minutes after it's launched. As Ray Autoscaler does not handle such SSH connection failure, the whole process would fail. I wonder if it's possible for Ray to support such error handling?

Versions / Dependencies

ray==1.9.2 python==3.8.11 on ubuntu-18.04.

Reproduction script

My YAML file used for ray up.

cluster_name: minimal

max_workers: 1

provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: XXX # Globally unique project id
    cache_stopped_nodes: true

auth:
    ssh_user: gcpuser

available_node_types:
  ray_head_default:
    resources: {}
    node_config:
      machineType: n1-highmem-8
      disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 256
          sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu-ubuntu-2004
      scheduling:
      - onHostMaintenance: TERMINATE

head_node_type: ray_head_default

file_mounts:
  ~/ray: ~/repos/ray

file_mounts_sync_continuously: false

Anything else

The problem happens very frequently to me.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

ray-project / ray