[autoscaler] 'ray up' disconnects when setting up head node on AWS

jaredvann commented 3 years ago

What is the problem?

Ray 1.3.0, Python 3.8.5, Ubuntu 20.04

Running ray up to deploy a cluster to AWS starts a head node but disconnects during pulling of Docker images.

Repeating multiple times with new clusters leads to the same result.

➜  cf ray up -y rayaws.yaml
/home/ubuntu/.local/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(
Cluster: test-cluster-2

Checking AWS environment settings
AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (head & workers): ray-autoscaler_us-east-1 [default]
  VPC Subnets (head & workers): subnet-1edc8411, subnet-6dbd0853, subnet-cfbc8a85, subnet-fc3458d2, subnet-3d9bfd5a, subnet-4bec8f17 [default]
  EC2 Security groups (head & workers): sg-03ac5fedb90eae533 [default]
  EC2 AMI (head & workers): ami-029510cec6d69f121 [dlami]

No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Acquiring an up-to-date head node
  Launched 1 nodes [subnet_id=subnet-1edc8411]
    Launched instance i-03f1c45a64baa5a23 [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 5 seconds
      Received: 44.192.14.51
ssh: connect to host 44.192.14.51 port 22: Connection timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 44.192.14.51 port 22: Connection timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 44.192.14.51 port 22: Connection timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 44.192.14.51 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '44.192.14.51' (ECDSA) to the list of known hosts.
 23:08:03 up 0 min,  1 user,  load average: 1.95, 0.48, 0.16
Shared connection to 44.192.14.51 closed.
    Success.
  Updating cluster configuration. [hash=8f11194d3af8fc498111a245daa5d7935b9f08a9]
  New status: syncing-files
  [2/7] Processing file mounts
Shared connection to 44.192.14.51 closed.
Shared connection to 44.192.14.51 closed.
  [3/7] No worker file mounts to sync
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initalizing command runner
Shared connection to 44.192.14.51 closed.
latest: Pulling from rayproject/ray-ml
345e3491a907: Pull complete 
57671312ef6f: Pull complete 
5e9250ddb7d0: Pull complete 
eb719956b105: Pull complete 
8cd8a3afa23c: Pull complete 
cd3cf60c1c0b: Pull complete 
11f3db4e7797: Extracting [======================>                            ]  85.79MB/192.1MB
da18343c65c1: Download complete 
2b0be91b4d3a: Download complete 
e6e0d3edc4b8: Download complete 
3ac0c76e7582: Download complete 
f42078a7599c: Download complete 
294cc332726f: Download complete 
1a551f022b27: Downloading [=====================================>             ]  2.484GB/3.355GB
Shared connection to 44.192.14.51 closed.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

Reproduction

Config file:

cluster_name: test-cluster-2

max_workers: 2

docker:
    image: "rayproject/ray-ml:latest"
    container_name: "ray_container"

provider:
    type: aws
    region: us-east-1

available_node_types:
    ray.head.default:
        min_workers: 0
        max_workers: 0
        node_config:
            InstanceType: m5.large

    ray.worker.default:
        min_workers: 0
        node_config:
            InstanceType: m5.large
            InstanceMarketOptions:
                MarketType: spot

setup_commands: []

yoavg commented 3 years ago

We observe similar failure mode also on GCP, with large-ish docker images. (This may be related to the 120 seconds timeout in the command running component (SSHCommandRunner / DockerCommandRunner).)

jaredvann commented 3 years ago

Issue is still occuring with Ray 1.4.0.

richardliaw commented 3 years ago

cc @ijrsvt can you address this?

ijrsvt commented 3 years ago

@yoavg the 120s timeout is only for the connection timing out, not for the actual connection.

ijrsvt commented 3 years ago

@jaredvann Can you rerun this with -vvvv flag. I am not able to reproduce this locally!

jaredvann commented 3 years ago

Unfortunately the flag doesn't appear to display any additional error information.

Something to note is I am trying to deploy this cluster from another AWS EC2 machine. Could this have any bearing on the issue?

ray up -y cfg.yml -vvvv
Cluster: jared-cluster3

Checking AWS environment settings
Created new security group ray-autoscaler-jared-cluster3 [id=sg-0246f6c4777c84307]
AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (all available node types): ray-autoscaler_us-east-2 [default]
  VPC Subnets (all available node types): subnet-03d7d5d51e9ce760f, subnet-0b72f759287232616, subnet-099787f35f4a49cbf [default]
  EC2 Security groups (all available node types): sg-0246f6c4777c84307 [default]
  EC2 AMI (all available node types): ami-08bf49c7b3a0c761e [dlami]

No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Acquiring an up-to-date head node
  Launched 1 nodes [subnet_id=subnet-03d7d5d51e9ce760f]
    Launched instance i-0d1985362e724658e [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 5 seconds
      Received: 3.142.221.13
    Running `uptime`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '3.142.221.13' (ECDSA) to the list of known hosts.
 10:57:52 up 0 min,  1 user,  load average: 2.79, 0.72, 0.24
Shared connection to 3.142.221.13 closed.
    Success.
  Updating cluster configuration. [hash=2ce54bb3a89cdab81fd95a6b11f97c53cc1abe0f]
  New status: syncing-files
  [2/7] Processing file mounts
    Running `mkdir -p /tmp/ray_tmp_mount/jared-cluster3/~ && chown -R ubuntu /tmp/ray_tmp_mount/jared-cluster3/~`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/jared-cluster3/~ && chown -R ubuntu /tmp/ray_tmp_mount/jared-cluster3/~)'`
Shared connection to 3.142.221.13 closed.
    Running `rsync --rsh ssh -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz /tmp/ray-bootstrap-oswznkvt ubuntu@3.142.221.13:/tmp/ray_tmp_mount/jared-cluster3/~/ray_bootstrap_config.yaml`
sending incremental file list
ray-bootstrap-oswznkvt

sent 894 bytes  received 35 bytes  1,858.00 bytes/sec
total size is 1,860  speedup is 2.00
    Running `docker inspect -f '{{.State.Running}}' ray_container || true`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_container || true)'`
Shared connection to 3.142.221.13 closed.
    `rsync`ed /tmp/ray-bootstrap-oswznkvt (local) to ~/ray_bootstrap_config.yaml (remote)
    ~/ray_bootstrap_config.yaml from /tmp/ray-bootstrap-oswznkvt
    Running `mkdir -p /tmp/ray_tmp_mount/jared-cluster3/~ && chown -R ubuntu /tmp/ray_tmp_mount/jared-cluster3/~`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/jared-cluster3/~ && chown -R ubuntu /tmp/ray_tmp_mount/jared-cluster3/~)'`
Shared connection to 3.142.221.13 closed.
    Running `rsync --rsh ssh -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem ubuntu@3.142.221.13:/tmp/ray_tmp_mount/jared-cluster3/~/ray_bootstrap_key.pem`
sending incremental file list
ray-autoscaler_us-east-2.pem

sent 1,414 bytes  received 35 bytes  2,898.00 bytes/sec
total size is 1,674  speedup is 1.16
    Running `docker inspect -f '{{.State.Running}}' ray_container || true`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_container || true)'`
Shared connection to 3.142.221.13 closed.
    `rsync`ed /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem (local) to ~/ray_bootstrap_key.pem (remote)
    ~/ray_bootstrap_key.pem from /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem
  [3/7] No worker file mounts to sync
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initalizing command runner
    Running `command -v docker || echo 'NoExist'`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (command -v docker || echo '"'"'NoExist'"'"')'`
Shared connection to 3.142.221.13 closed.
    Running `docker pull rayproject/ray-ml:latest`
      Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray-ml:latest)'`
latest: Pulling from rayproject/ray-ml
c549ccf8d472: Pull complete 
4037933347b0: Pull complete 
3893765d06c1: Pull complete 
80dd2646b97c: Pull complete 
2d7a8f4b8186: Pull complete 
14991ca1c4f1: Pull complete 
8ebf2c29d653: Pull complete 
31aec75da759: Pull complete 
24442a832953: Pull complete 
d91380edd589: Pull complete 
eeac7b3a6ee2: Pull complete 
171c6f068f3d: Downloading [=========================================>         ]  2.899GB/3.484GB
Shared connection to 3.142.221.13 closed.
  New status: update-failed
  !!!
  {'message': 'SSH command failed.'}
  SSH command failed.
  !!!

  Failed to setup head node.

ijrsvt commented 3 years ago

@jaredvann I don't think so :/ (I tried repro'ing from an EC2 machine). Do you know if it stalls at this and then stops? By this I mean does it freeze at:

171c6f068f3d: Downloading [=========================================>         ]  2.899GB/3.484GB

and then close, or does it just close abruptly?

jaredvann commented 3 years ago

Error still occurs with version 1.5.1.

@ijrsvt the download bar freezes for about 10-15 seconds before showing the error message and the process exiting.

ijrsvt commented 3 years ago

@jaredvann Hmm, not really sure why this keeps breaking for you. Could you try adding the following initialization_command?

initializatio_commands:
  - docker pull rayproject/ray-ml:latest

jaredvann commented 3 years ago

@ijrsvt no change.

AmeerHajAli commented 3 years ago

@jaredvann , I ran the same repro as you are providing and everything just worked fine: maybe try a smaller image (rayproject/ray:latest)?

  [5/7] Initalizing command runner
Shared connection to 107.23.178.9 closed.
latest: Pulling from rayproject/ray-ml
c549ccf8d472: Pull complete 
4037933347b0: Pull complete 
3893765d06c1: Pull complete 
80dd2646b97c: Pull complete 
2d7a8f4b8186: Pull complete 
14991ca1c4f1: Pull complete 
8ebf2c29d653: Pull complete 
31aec75da759: Pull complete 
24442a832953: Pull complete 
d91380edd589: Pull complete 
eeac7b3a6ee2: Pull complete 
171c6f068f3d: Pull complete 
Digest: sha256:4cd2e4233ba1891a0e907b33e337b6818c70806b5e4e75916a2d2fd7532ba86d
Status: Downloaded newer image for rayproject/ray-ml:latest
docker.io/rayproject/ray-ml:latest
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Shared connection to 107.23.178.9 closed.
2021-08-13 03:54:21,377 WARNING command_runner.py:904 -- Nvidia Container Runtime is present, but no GPUs found.
dc9e6e8ce6ac404221e89d56d2a7d131cf5ceca5273705658876099f200ebf43
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
  [6/7] Running setup commands
    (0/1) pip install 'boto3>=1.4.8'
Requirement already satisfied: boto3>=1.4.8 in ./anaconda3/lib/python3.7/site-packages (1.4.8)
Requirement already satisfied: s3transfer<0.2.0,>=0.1.10 in ./anaconda3/lib/python3.7/site-packages (from boto3>=1.4.8) (0.1.13)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in ./anaconda3/lib/python3.7/site-packages (from boto3>=1.4.8) (0.10.0)
Requirement already satisfied: botocore<1.9.0,>=1.8.0 in ./anaconda3/lib/python3.7/site-packages (from boto3>=1.4.8) (1.8.50)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in ./anaconda3/lib/python3.7/site-packages (from botocore<1.9.0,>=1.8.0->boto3>=1.4.8) (2.8.1)
Requirement already satisfied: docutils>=0.10 in ./anaconda3/lib/python3.7/site-packages (from botocore<1.9.0,>=1.8.0->boto3>=1.4.8) (0.17.1)
Requirement already satisfied: six>=1.5 in ./anaconda3/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.9.0,>=1.8.0->boto3>=1.4.8) (1.15.0)
WARNING: You are using pip version 21.1.3; however, version 21.2.4 is available.
You should consider upgrading via the '/home/ray/anaconda3/bin/python -m pip install --upgrade pip' command.
Shared connection to 107.23.178.9 closed.
  [7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 107.23.178.9 closed.
Local node IP: 172.31.21.89
2021-08-12 17:54:53,223 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265

richardliaw commented 3 years ago

ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray-ml:latest

is the command that fails right? maybe you can do ssh -vv -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem this may help diagnosing the output.

ijrsvt commented 3 years ago

@jaredvann were you able to try out @richardliaw 's suggestion?

mataney commented 2 years ago

This still happens in version 1.7.1 on gcp. using image: "rayproject/ray:latest-gpu" as well.

Any help?

smacpher commented 2 years ago

I'm also encountering this with ray version 1.10.0 on GCP using image: "rayproject/ray:1.10.0-py38-cpu". Looks like I'm getting an unexpected EoF towards the end up the docker pull step when it's extracting downloads:

... etc.
345decc673e1: Extracting  1.093GB/2.187GB
c033e59f74d1: Download complete
unexpected EOF
Shared connection to <ip> closed.
2022-03-09 08:55:44,955 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation oper
ation-1646844944679-5d9cbf7ed82ee-ecef8ae4-f6858e65 to finish...
2022-03-09 08:55:50,293 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-164684
4944679-5d9cbf7ed82ee-ecef8ae4-f6858e65 finished.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

Thanks in advance for your time!

AmeerHajAli commented 2 years ago

@wuisawesome , can you please take a look/triage when you have a chance?

wuisawesome commented 2 years ago

Upgrading to P1 since it sounds like it's affecting more users now.

@mataney @smacpher can either of you provide a full cluster yaml? or at least the AMI/machine image you're using (or if it's left unspecified, and the setup command you specify if any?)

smacpher commented 2 years ago

Hi @wuisawesome thank you for taking a look at this issue. Here's my cluster yaml with some internal things like project ID and code paths redacted:

cluster_name: test
max_workers: 2

upscaling_speed: 1.0

docker:
  image: <custom image that inherits from rayproject/ray:1.10.0-py38-cpu>
  container_name: "ray_container"
  pull_before_run: True
  run_options:
    - --ulimit nofile=65536:65536

idle_timeout_minutes: 5

provider:
  type: gcp
  region: us-west1
  availability_zone: us-west1-a
  project_id: <project-id>

auth:
  ssh_user: ubuntu

available_node_types:
  ray_head_default:
    resources: { "CPU": 4 }
    node_config:
      machineType: n1-standard-4
      disks:
        - boot: true
          autoDelete: true
          type: PERSISTENT
          initializeParams:
            diskSizeGb: 50
            sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu

head_node_type: ray_head_default

file_mounts:
  { <path/to/my_first_part_code>: <path/to/my_first_part_code> }

rsync_exclude:
  - "**/.git"
  - "**/.git/**"

rsync_filter:
  - ".gitignore"

initialization_commands: []

setup_commands:
  - export PYTHONPATH=$PYTHONPATH:<path/to/my_first_part_code>

head_setup_commands:
  - pip install google-api-python-client==1.7.8

worker_setup_commands: []

head_start_ray_commands:
  - ray stop
  - >-
    ray start
    --head
    --port=6379
    --dashboard-port 8625
    --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
  - ray stop
  - >-
    ray start
    --address=$RAY_HEAD_IP:6379
    --object-manager-port=8076

head_node: {}
worker_nodes: {}

I've noticed that this was happening more often for me on smaller machine types (e.g. n1-standard-2) and also happens more often when using my custom image. I suspect it may have to do with my image having a couple large layers (~2 GBs). It cuts off during the extraction phase: 345decc673e1: Extracting 1.418GB/2.187GB then I see an Unexpected EoF and things fail.

When I bumped up to n1-standard-4 and larger, no more failures during the docker pull step for me, but I started running into a new error. And I can consistently reproduce this using the exact YAML I attached:

... etc.

  [6/7] Running setup commands
    (0/2) export PYTHONPATH=$PYTHONPATH:...
Shared connection to 34.82.212.231 closed.
    (1/2) pip install google-api-python-...
Requirement already satisfied: google-api-python-client==1.7.8 in ./anaconda3/lib/python3.8/site-packages (1.7.8)
Requirement already satisfied: uritemplate<4dev,>=3.0.0 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (3.0.1)
Requirement already satisfied: google-auth-httplib2>=0.0.3 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (0.1.0)
Requirement already satisfied: google-auth>=1.4.1 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (2.4.1)
Requirement already satisfied: httplib2<1dev,>=0.9.2 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (0.20.2)
Requirement already satisfied: six<2dev,>=1.6.1 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (1.13.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in ./anaconda3/lib/python3.8/site-packages (from google-auth>=1.4.1->google-api-python-client==1.7.8) (5.0.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in ./anaconda3/lib/python3.8/site-packages (from google-auth>=1.4.1->google-api-python-client==1.7.8) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4 in ./anaconda3/lib/python3.8/site-packages (from google-auth>=1.4.1->google-api-python-client==1.7.8) (4.8)
Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in ./anaconda3/lib/python3.8/site-packages (from httplib2<1dev,>=0.9.2->google-api-python-client==1.7.8) (3.0.7)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in ./anaconda3/lib/python3.8/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.4.1->google-api-python-client==1.7.8) (0.4.8)
Shared connection to 34.82.212.231 closed.
  [7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 34.82.212.231 closed.
Error: No such container: ray_container
Shared connection to 34.82.212.231 closed.
2022-03-11 09:24:29,027 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647019468732-5d9f49a5f36f9-14e9d270-ed9bb0d9 to finish...
2022-03-11 09:24:34,310 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647019468732-5d9f49a5f36f9-14e9d270-ed9bb0d9 finished.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

Note the error message: Error: No such container: ray_container. Is it possible there's some sort of race condition where the setup commands start before the container has finished starting? (I may be way off here, I haven't dug into the autoscaler code yet.)

When I run ray up config.yaml -y again, it succeeds, and much faster since most steps were already run by my first attempt.

This also happens on the head node when trying to use the autoscaler to add more nodes. Sometimes after retrying it succeeds, but it's inconsistent.

Thanks again for your time! And let me know if I can help in anyway.

melonipoika commented 2 years ago

I think I am having the same issue on GCP:

Cluster: gpu-docker-a100-jose

2022-03-16 10:38:15,706 INFO util.py:278 -- setting max workers for head node type to 0
Checking GCP environment settings
2022-03-16 10:38:16,374 INFO config.py:451 -- _configure_key_pair: Private key not specified in config, using/home/jupyter/.ssh/ray-autoscaler_gcp_us-central1_<project_id>_ubuntu_0.pem
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Acquiring an up-to-date head node
2022-03-16 10:38:18,497 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427096700-5da5382e4653d-ff584f4b-70b16514 to finish...
2022-03-16 10:38:29,097 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427096700-5da5382e4653d-ff584f4b-70b16514 finished.
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
2022-03-16 10:38:29,934 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427109482-5da5383a77032-355f78e7-90f5d570 to finish...
2022-03-16 10:38:35,350 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427109482-5da5383a77032-355f78e7-90f5d570 finished.
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 104.154.73.96
ssh: connect to host 104.154.73.96 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '104.154.73.96' (ECDSA) to the list of known hosts.
 10:39:08 up 0 min,  1 user,  load average: 0.79, 0.19, 0.06
Shared connection to 104.154.73.96 closed.
    Success.
  Updating cluster configuration. [hash=73d374e9aed5abf00300cb49f327feb32491e4ee]
2022-03-16 10:39:08,632 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427148195-5da5385f6259b-91093f21-84dd5f09 to finish...
2022-03-16 10:39:14,054 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427148195-5da5385f6259b-91093f21-84dd5f09 finished.
  New status: syncing-files
  [2/7] Processing file mounts
Shared connection to 104.154.73.96 closed.
Shared connection to 104.154.73.96 closed.
  [3/7] No worker file mounts to sync
2022-03-16 10:39:19,821 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427159340-5da5386a036d4-d4e8ba7b-470baea2 to finish...
2022-03-16 10:39:25,204 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427159340-5da5386a036d4-d4e8ba7b-470baea2 finished.
  New status: setting-up
  [4/7] Running initialization commands
Warning: Permanently added '104.154.73.96' (ECDSA) to the list of known hosts.
WARNING: Your config file at [/home/ubuntu/.docker/config.json] contains these credential helper entries:

{
  "credHelpers": {
    "gcr.io": "gcloud",
    "us.gcr.io": "gcloud",
    "eu.gcr.io": "gcloud",
    "asia.gcr.io": "gcloud",
    "staging-k8s.gcr.io": "gcloud",
    "marketplace.gcr.io": "gcloud"
  }
}
Adding credentials for: us-central1-docker.pkg.dev
Docker configuration file updated.
Connection to 104.154.73.96 closed.
  [5/7] Initalizing command runner
Shared connection to 104.154.73.96 closed.
latest-gpu: Pulling from rayproject/ray-ml

55322776: Pulling fs layer 
743589ce: Pulling fs layer 
3cedd2c6: Pulling fs layer 
39d99446: Pulling fs layer 
39450a51: Pulling fs layer 
0854cdb4: Pulling fs layer 
257910db: Pulling fs layer 
9b80ddf2: Pulling fs layer 
ec0a3755: Pulling fs layer 
51c3bfb6: Pulling fs layer 
91efdf9f: Pulling fs layer 
7197939e: Pulling fs layer 
b24f58c2: Pulling fs layer 
8cc981d5: Pulling fs layer 
72d152ed: Pulling fs layer 
9065bada: Pulling fs layer 
c8e1064c: Pulling fs layer 
f7a46693: Pulling fs layer 
04d31d20: Pulling fs layer 
10cc91f4: Pulling fs layer 
eac638d3: Pulling fs layer 
53b8f2f8: Pulling fs layer 
145a2924: Pulling fs layer 
122cddae: Pulling fs layer 
6a1c067d: Pulling fs layer 
b80ddf2: Extracting  1.001GB/1.146GBBunexpected EOF
Shared connection to 104.154.73.96 closed.
2022-03-16 10:42:24,758 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427344128-5da5391a3d926-91dd9ad3-ea09ff8f to finish...
2022-03-16 10:42:30,129 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427344128-5da5391a3d926-91dd9ad3-ea09ff8f finished.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

Re-running ray up cluster.yml again works fine

oscartackstrom commented 2 years ago

Re: melonipoika

Running on GCP, we used to see this, but after adding bash -c $'ps -e | grep apt | awk \'{print $1}\' | xargs tail -f --pid || true' # Wait for auto upgrade that might run in the background. first in initialization_commands this went away. It seems that GCP runs an auto-upgrade in the background on the node that can interfere with things like docker pull.

Allgoerithm commented 2 years ago

@oscartackstrom I've added the line to the initialization commands as you suggested, but when I then run ray up, it fails because of a syntax error caused by the curly braces in {print $1}. What am I doing wrong? Just removing the curly braces fixes the syntax error, but then starting the head node fails with an SSH problem just as before.

vaxherra commented 2 years ago

Upvote from me also.

In Ray 1.12.1 while running on GCP I am also experiencing similar errors, even with the initialization command from @oscartackstrom it doesn't work properly.

PaulFenton commented 1 year ago

I also experienced this error but got it working by re-running. Details:

I started with the instructions to launch ray on GCP.

On first run of ray up example-full.yaml I got:

.....
Shared connection to 35.223.182.31 closed.
latest-gpu: Pulling from rayproject/ray-ml
09db6f815738: Pull complete 
d79696845ef2: Pull complete 
9cace1db9258: Pull complete 
5093a1370488: Pull complete 
affabbc9735b: Pull complete 
839e92906efc: Pull complete 
36d15b49ae4c: Pull complete 
be6750df422d: Pull complete 
02a4c72adbe9: Pull complete 
cc3c4b345c51: Pull complete 
f87c4242b0c3: Pull complete 
fa3dc8316827: Pull complete 
56af5a0d16a1: Pull complete 
e6f237e89986: Pull complete 
53d40d13bcdf: Extracting [=>                                                 ]  8.913MB/245.2MB
ef63e11c9202: Download complete 
977a683aa960: Download complete 
e71a2f98112f: Download complete 
d1835b6f204b: Download complete 
c55db8918060: Download complete 
106e9528714d: Download complete 
2309e5741e2a: Download complete 
57cf067a6500: Download complete 
115e34d38940: Download complete 
f52b4c403e2d: Download complete 
18684af57161: Download complete 
ae14a5017bf5: Download complete 
8333c436436b: Download complete 
unexpected EOF
Shared connection to 35.223.182.31 closed.
2022-10-16 08:40:57,081 INFO node.py:311 -- wait_for_compute_zone_operation: Waiting for operation operation-1665924056563-5eb262b9d4053-fc3deff7-df7b529a to finish...
2022-10-16 08:41:02,484 INFO node.py:330 -- wait_for_compute_zone_operation: Operation operation-1665924056563-5eb262b9d4053-fc3deff7-df7b529a finished.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

Then after re-running with verbose debug statements (ray up example-full.yaml -vvvv), it seems to have worked. The initial prompt mentioned "re-starting":

$ ray up example-full.yaml -vvvv
Cluster: default

2022-10-16 08:48:16,736 INFO util.py:364 -- setting max workers for head node type to 0
Loaded cached provider configuration from /tmp/ray-config-0fd8db53e13c59f92d610a668ec5757f192b7c1d
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y

Eventually it was able to download the image and start things up:

--------------------
Ray runtime started.
--------------------

Wonder if it is some kind of timeout issue that was able to pick up from where it left off?

idantene commented 1 year ago

My Ray head node fails even earlier, after running ray up config.yaml I get:

cquiring an up-to-date head node
  Launched 1 nodes [network_interfaces=[{'DeviceIndex': 0, 'AssociatePublicIpAddress': False, 'SubnetId': '...', 'Groups': ['...']}]]
    Launched instance i-068b76f0a85130105 [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node
    Head node fetch timed out. Failed to create head node.

Presumably related to the hard-coded timeout of 50s here - https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/commands.py#L746 I then have to run ray up config.yaml again and it works fine.

PaulFenton commented 1 year ago

Continuing to debug this - The call that is failing is the docker pull command that takes place on the head node when the cluster is initializing:

https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/command_runner.py#L905-L907

I suspect running sudo dockerd --max-download-attempts 10 before docker is started could fix this issue - trying to figure out how/where to do that.

richardliaw commented 1 year ago

maybe you can try putting it into the initialization_commands? that runs before the docker pull/.

On Sun, Oct 23, 2022 at 1:48 PM Paul Fenton @.***> wrote:

Continuing to debug this - The call that is failing is the docker pull command that takes place on the head node when the cluster is initializing:

https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/command_runner.py#L905-L907

I suspect running sudo dockerd --max-download-attempts 10 before docker is started could fix this issue - trying to figure out how/where to do that.

— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/15893#issuecomment-1288198048, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZLNTCJL7NYEQL32F2TWEWQA7ANCNFSM45DO7KJA . You are receiving this because you were mentioned.Message ID: @.***>

bveeramani commented 1 year ago

cc @wuisawesome

alemaizi commented 1 year ago

I've had the same issue, the thing is even if you rerun and it works fine for the head, it will do the same for the worker nodes and hence putting them in a constant state of "pending".

alemaizi commented 1 year ago

Re: melonipoika

Running on GCP, we used to see this, but after adding bash -c $'ps -e | grep apt | awk \'{print $1}\' | xargs tail -f --pid || true' # Wait for auto upgrade that might run in the background. first in initialization_commands this went away. It seems that GCP runs an auto-upgrade in the background on the node that can interfere with things like docker pull.

This works like a charm for GCP, Thank you @oscartackstrom ! Only downfall, it takes a bit more time knowing that it has to "wait".

iirekm commented 1 year ago

2023, version 2.6.3 - still the same on AWS. (+many, many other issues - is anybody working on ray up on AWS???)

# use: ray up -y ray-cluster-config.yaml to start it

cluster_name: test
max_workers: 16

setup_commands:
  # unattended upgrades is a mess; also problems with APT, "uninstall" manually
  - "[ -f ~/.ok ] || (while ! sudo rm /usr/bin/unattended-upgrade; do sleep 1; done)"
  - "[ -f ~/.ok ] || (sudo killall -9 unattended-upgrade || true)"  # just in case
  - touch ~/.ok
  # installed wrong version!
  - pip install "ray[default]==2.6.3"

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: False

available_node_types:
  ray.head.default:
    node_config:
      InstanceType: m5.large
  ray.worker.default:
    max_workers: 16
    node_config:
      InstanceType: m5.large
      InstanceMarketOptions:
        MarketType: spot

anyscalesam commented 3 months ago

Adding 'core' and 'triage' to catch in next week's weekly GH triage at Anyscale; sorry for the delay folks.

jjyao commented 3 months ago

Hi @iirekm, are you able to use the latest Ray instead of 2.6.3. I tried

# use: ray up -y ray-cluster-config.yaml to start it

cluster_name: test
max_workers: 16

setup_commands:
  # unattended upgrades is a mess; also problems with APT, "uninstall" manually
  - "[ -f ~/.ok ] || (while ! sudo rm /usr/bin/unattended-upgrade; do sleep 1; done)"
  - "[ -f ~/.ok ] || (sudo killall -9 unattended-upgrade || true)"  # just in case
  - touch ~/.ok
  # installed wrong version!
  - pip install "ray[default]"

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: False

available_node_types:
  ray.head.default:
    node_config:
      InstanceType: m5.large
  ray.worker.default:
    max_workers: 16
    node_config:
      InstanceType: m5.large
      InstanceMarketOptions:
        MarketType: spot

and it works.

ray-project / ray

[autoscaler] 'ray up' disconnects when setting up head node on AWS #15893

What is the problem?

Reproduction