ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

Failed to execute cluster as described in the documentation: "No such file or directory: 'rsync'" #37823

Closed d33tah closed 1 year ago

d33tah commented 1 year ago

What happened + What you expected to happen

I tried to set up a test Ray cluster as described in the documentation. In order to reproduce, create a file named Dockerfile with the following contents:

FROM python:3.10
ADD ./credentials /root/.aws/credentials
RUN pip install boto3 'ray[default]'
RUN wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/example-full.yaml
RUN ray up example-full.yaml --yes
RUN ray attach example-full.yaml
RUN python -c 'import ray; ray.init()'
RUN ray down example-full.yaml

Then run sudo docker build . and observe the following:

Sending build context to Docker daemon   5.12kB
Step 1/8 : FROM python:3.10
 ---> 2b7ca628da40
Step 2/8 : ADD ./credentials /root/.aws/credentials
 ---> Using cache
 ---> c81d7ab743dc
Step 3/8 : RUN pip install boto3 'ray[default]'
 ---> Using cache
 ---> ac315a001393
Step 4/8 : RUN wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/example-full.yaml
 ---> Using cache
 ---> c7eac09e5e9d
Step 5/8 : RUN ray up example-full.yaml --yes
 ---> Running in 90454572f103
2023-07-26 17:52:02,599 INFO util.py:375 -- setting max workers for head node type to 0
ssh: connect to host 52.25.176.78 port 22: Connection timed out
2023-07-26 17:52:02,582 INFO commands.py:287 -- Cluster: default
2023-07-26 17:52:02,697 INFO commands.py:364 -- Checking AWS environment settings
2023-07-26 17:52:02,701 VINFO utils.py:149 -- Creating AWS resource `ec2` in `us-west-2`
2023-07-26 17:52:18,487 VINFO utils.py:149 -- Creating AWS resource `iam` in `us-west-2`
2023-07-26 17:52:19,189 VINFO utils.py:149 -- Creating AWS resource `ec2` in `us-west-2`
2023-07-26 17:52:20,848 VINFO config.py:395 -- Creating new key pair ray-autoscaler_2_us-west-2 for use as the default.
2023-07-26 17:52:24,067 INFO config.py:123 -- AWS config
2023-07-26 17:52:24,068 INFO config.py:204 -- IAM Profile: ray-autoscaler-v1 [default]
2023-07-26 17:52:24,068 INFO config.py:159 -- EC2 Key pair (all available node types): ray-autoscaler_2_us-west-2 [default]
2023-07-26 17:52:24,068 INFO config.py:159 -- VPC Subnets (all available node types): subnet-40c4ae26, subnet-3fa42e77 [default]
2023-07-26 17:52:24,068 INFO config.py:159 -- EC2 Security groups (all available node types): sg-018a86c47e4024ee1 [default]
2023-07-26 17:52:24,068 INFO config.py:159 -- EC2 AMI (all available node types): ami-0387d929287ab193e
2023-07-26 17:52:24,068 VINFO utils.py:149 -- Creating AWS resource `ec2` in `us-west-2`
2023-07-26 17:52:24,398 INFO commands.py:694 -- Updating cluster configuration and running full setup.
2023-07-26 17:52:24,399 INFO commands.py:695 -- Cluster Ray runtime will be restarted. Confirm [y/N]: y [automatic, due to --yes]
2023-07-26 17:52:24,399 INFO usage_lib.py:407 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-07-26 17:52:24,400 INFO commands.py:905 -- Currently running head node is out-of-date with cluster configuration
2023-07-26 17:52:24,400 INFO commands.py:910 -- Current hash is d392b2cd56e1b1d379c45d40726c1030c83f4914, expected c32421f77ca08d5142efeedb7ac81e3f96944cb8
2023-07-26 17:52:24,400 INFO commands.py:722 -- Acquiring an up-to-date head node
2023-07-26 17:52:24,400 INFO commands.py:727 -- Relaunching the head node. Confirm [y/N]: y [automatic, due to --yes]
2023-07-26 17:52:24,400 INFO node_provider.py:491 -- Stopping instance i-0b35b1ffc92511ae2 (to terminate instead, set `cache_stopped_nodes: False` under `provider` in the cluster configuration)
2023-07-26 17:52:24,965 INFO commands.py:730 -- Terminated head node i-0b35b1ffc92511ae2
2023-07-26 17:52:27,429 INFO node_provider.py:429 -- Launched 1 nodes [subnet_id=subnet-40c4ae26]
2023-07-26 17:52:27,429 INFO node_provider.py:443 -- Launched instance i-0b1e6c23e09444d0a [state=pending, info=pending]
2023-07-26 17:52:27,429 INFO commands.py:738 -- Launched a new head node
2023-07-26 17:52:27,430 INFO commands.py:742 -- Fetching the new head node
2023-07-26 17:52:27,781 INFO commands.py:757 -- <1/1> Setting up head node
2023-07-26 17:52:27,782 INFO commands.py:778 -- Prepared bootstrap config
2023-07-26 17:52:29,091 INFO updater.py:324 -- New status: waiting-for-ssh
2023-07-26 17:52:29,091 INFO updater.py:261 -- [1/7] Waiting for SSH to become available
2023-07-26 17:52:29,091 INFO updater.py:266 -- Running `uptime` as a test.
2023-07-26 17:52:29,092 INFO command_runner.py:204 -- Fetched IP: 52.25.176.78
2023-07-26 17:52:29,092 VINFO command_runner.py:371 -- Running `uptime`
2023-07-26 17:52:29,093 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i /root/.ssh/ray-autoscaler_2_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@52.25.176.78 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 52.25.176.78 port 22: Connection timed out
2023-07-26 17:52:39,161 INFO updater.py:312 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2023-07-26 17:52:44,167 VINFO command_runner.py:371 -- Running `uptime`
2023-07-26 17:52:44,167 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i /root/.ssh/ray-autoscaler_2_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@52.25.176.78 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 52.25.176.78 port 22: Connection refused
2023-07-26 17:52:54,205 INFO updater.py:312 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2023-07-26 17:52:59,211 VINFO command_runner.py:371 -- Running `uptime`
2023-07-26 17:52:59,211 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i /root/.ssh/ray-autoscaler_2_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@52.25.176.78 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '52.25.176.78' (ECDSA) to the list of known hosts.
 17:53:14 up 0 min,  1 user,  load average: 0.94, 0.22, 0.08
Shared connection to 52.25.176.78 closed.
2023-07-26 17:52:59,455 INFO updater.py:312 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2023-07-26 17:53:04,460 VINFO command_runner.py:371 -- Running `uptime`
2023-07-26 17:53:04,461 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i /root/.ssh/ray-autoscaler_2_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@52.25.176.78 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Shared connection to 52.25.176.78 closed.
2023-07-26 17:53:14,234 SUCC updater.py:280 -- Success.
2023-07-26 17:53:14,235 VINFO utils.py:149 -- Creating AWS resource `ssm` in `us-west-2`
2023-07-26 17:53:14,280 VINFO utils.py:170 -- Creating AWS client `ssm` in `us-west-2`
2023-07-26 17:53:14,307 VINFO utils.py:149 -- Creating AWS resource `cloudwatch` in `us-west-2`
2023-07-26 17:53:14,316 INFO updater.py:374 -- Updating cluster configuration. [hash=8e0ed32642424ce79a6b9dd9f4485e7a3952a457]
2023-07-26 17:53:15,593 INFO updater.py:381 -- New status: syncing-files
2023-07-26 17:53:15,594 INFO updater.py:238 -- [2/7] Processing file mounts
2023-07-26 17:53:15,594 VINFO command_runner.py:371 -- Running `mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~`
2023-07-26 17:53:15,595 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i /root/.ssh/ray-autoscaler_2_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@52.25.176.78 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~)'`
2023-07-26 17:53:16,659 VINFO command_runner.py:414 -- Running `rsync --rsh ssh -i /root/.ssh/ray-autoscaler_2_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore /tmp/ray-bootstrap-i8j3cqcw ubuntu@52.25.176.78:/tmp/ray_tmp_mount/default/~/ray_bootstrap_config.yaml`
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 153, in run
    self.do_update()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 382, in do_update
    self.sync_file_mounts(self.rsync_up, step_numbers=(1, NUM_SETUP_STEPS))
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 242, in sync_file_mounts
    do_sync(remote_path, local_path)
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 229, in do_sync
    sync_cmd(local_path, remote_path, docker_mount_if_possible=True)
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 535, in rsync_up
    self.cmd_runner.run_rsync_up(source, target, options=options)
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 516, in run_rsync_up
    self.ssh_command_runner.run_rsync_up(source, host_destination, options=options)
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 415, in run_rsync_up
    self._run_helper(command, silent=is_rsync_silent())
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 272, in _run_helper
    return run_cmd_redirected(
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/subprocess_output_util.py", line 341, in run_cmd_redirected
    return _run_and_process_output(
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/subprocess_output_util.py", line 243, in _run_and_process_output
    return process_runner.check_call(
  File "/usr/local/lib/python3.10/subprocess.py", line 364, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/local/lib/python3.10/subprocess.py", line 345, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/usr/local/lib/python3.10/subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/lib/python3.10/subprocess.py", line 1842, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'rsync'
2023-07-26 17:53:18,241 PANIC commands.py:830 -- Failed to setup head node.
Error: Failed to setup head node.
2023-07-26 17:53:17,949 ERR updater.py:158 -- New status: update-failed
2023-07-26 17:53:17,949 ERR updater.py:160 -- !!!
2023-07-26 17:53:17,949 VERR updater.py:168 -- {}
2023-07-26 17:53:17,949 ERR updater.py:170 -- [Errno 2] No such file or directory: 'rsync'
2023-07-26 17:53:17,950 ERR updater.py:172 -- !!!
The command '/bin/sh -c ray up example-full.yaml --yes' returned a non-zero code: 1

Versions / Dependencies

  1. Docker 20.10.21, image Python 3.10 version: 2b7ca628da40
  2. ray-2.6.1
  3. Dependencies: aiohttp-3.8.5 aiohttp-cors-0.7.0 aiosignal-1.3.1 async-timeout-4.0.2 attrs-23.1.0 blessed-1.20.0 boto3-1.28.11 botocore-1.31.11 cachetools-5.3.1 certifi-2023.7.22 charset-normalizer-3.2.0 click-8.1.6 colorful-0.5.5 distlib-0.3.7 filelock-3.12.2 frozenlist-1.4.0 google-api-core-2.11.1 google-auth-2.22.0 googleapis-common-protos-1.59.1 gpustat-1.1 grpcio-1.56.2 idna-3.4 jmespath-1.0.1 jsonschema-4.18.4 jsonschema-specifications-2023.7.1 msgpack-1.0.5 multidict-6.0.4 numpy-1.25.1 nvidia-ml-py-12.535.77 opencensus-0.11.2 opencensus-context-0.1.3 packaging-23.1 platformdirs-3.9.1 prometheus-client-0.17.1 protobuf-4.23.4 psutil-5.9.5 py-spy-0.3.14 pyasn1-0.5.0 pyasn1-modules-0.3.0 pydantic-1.10.12 python-dateutil-2.8.2 pyyaml-6.0.1 referencing-0.30.0 requests-2.31.0 rpds-py-0.9.2 rsa-4.9 s3transfer-0.6.1 six-1.16.0 smart-open-6.3.0 typing-extensions-4.7.1 urllib3-1.26.16 virtualenv-20.21.0 wcwidth-0.2.6 yarl-1.9.2

Reproduction script

In order to reproduce, create a file named Dockerfile with the following contents:

FROM python:3.10
ADD ./credentials /root/.aws/credentials
RUN pip install boto3 'ray[default]'
RUN wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/example-full.yaml
RUN ray up example-full.yaml --yes
RUN ray attach example-full.yaml
RUN python -c 'import ray; ray.init()'
RUN ray down example-full.yaml

Then run sudo docker build .

Issue Severity

High: It blocks me from completing my task.

d33tah commented 1 year ago

The issue was not with the AMI, but the Docker script. What I needed was RUN apt-get update && apt-get install rsync -y.