Open jaredvann opened 3 years ago
We observe similar failure mode also on GCP, with large-ish docker images. (This may be related to the 120 seconds timeout in the command running component (SSHCommandRunner / DockerCommandRunner).)
Issue is still occuring with Ray 1.4.0.
cc @ijrsvt can you address this?
@yoavg the 120s
timeout is only for the connection timing out, not for the actual connection.
@jaredvann Can you rerun this with -vvvv
flag. I am not able to reproduce this locally!
Unfortunately the flag doesn't appear to display any additional error information.
Something to note is I am trying to deploy this cluster from another AWS EC2 machine. Could this have any bearing on the issue?
ray up -y cfg.yml -vvvv
Cluster: jared-cluster3
Checking AWS environment settings
Created new security group ray-autoscaler-jared-cluster3 [id=sg-0246f6c4777c84307]
AWS config
IAM Profile: ray-autoscaler-v1 [default]
EC2 Key pair (all available node types): ray-autoscaler_us-east-2 [default]
VPC Subnets (all available node types): subnet-03d7d5d51e9ce760f, subnet-0b72f759287232616, subnet-099787f35f4a49cbf [default]
EC2 Security groups (all available node types): sg-0246f6c4777c84307 [default]
EC2 AMI (all available node types): ami-08bf49c7b3a0c761e [dlami]
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
Acquiring an up-to-date head node
Launched 1 nodes [subnet_id=subnet-03d7d5d51e9ce760f]
Launched instance i-0d1985362e724658e [state=pending, info=pending]
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Waiting for IP
Not yet available, retrying in 5 seconds
Received: 3.142.221.13
Running `uptime`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection timed out
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection timed out
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection timed out
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection timed out
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 3.142.221.13 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '3.142.221.13' (ECDSA) to the list of known hosts.
10:57:52 up 0 min, 1 user, load average: 2.79, 0.72, 0.24
Shared connection to 3.142.221.13 closed.
Success.
Updating cluster configuration. [hash=2ce54bb3a89cdab81fd95a6b11f97c53cc1abe0f]
New status: syncing-files
[2/7] Processing file mounts
Running `mkdir -p /tmp/ray_tmp_mount/jared-cluster3/~ && chown -R ubuntu /tmp/ray_tmp_mount/jared-cluster3/~`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/jared-cluster3/~ && chown -R ubuntu /tmp/ray_tmp_mount/jared-cluster3/~)'`
Shared connection to 3.142.221.13 closed.
Running `rsync --rsh ssh -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz /tmp/ray-bootstrap-oswznkvt ubuntu@3.142.221.13:/tmp/ray_tmp_mount/jared-cluster3/~/ray_bootstrap_config.yaml`
sending incremental file list
ray-bootstrap-oswznkvt
sent 894 bytes received 35 bytes 1,858.00 bytes/sec
total size is 1,860 speedup is 2.00
Running `docker inspect -f '{{.State.Running}}' ray_container || true`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_container || true)'`
Shared connection to 3.142.221.13 closed.
`rsync`ed /tmp/ray-bootstrap-oswznkvt (local) to ~/ray_bootstrap_config.yaml (remote)
~/ray_bootstrap_config.yaml from /tmp/ray-bootstrap-oswznkvt
Running `mkdir -p /tmp/ray_tmp_mount/jared-cluster3/~ && chown -R ubuntu /tmp/ray_tmp_mount/jared-cluster3/~`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/jared-cluster3/~ && chown -R ubuntu /tmp/ray_tmp_mount/jared-cluster3/~)'`
Shared connection to 3.142.221.13 closed.
Running `rsync --rsh ssh -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem ubuntu@3.142.221.13:/tmp/ray_tmp_mount/jared-cluster3/~/ray_bootstrap_key.pem`
sending incremental file list
ray-autoscaler_us-east-2.pem
sent 1,414 bytes received 35 bytes 2,898.00 bytes/sec
total size is 1,674 speedup is 1.16
Running `docker inspect -f '{{.State.Running}}' ray_container || true`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_container || true)'`
Shared connection to 3.142.221.13 closed.
`rsync`ed /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem (local) to ~/ray_bootstrap_key.pem (remote)
~/ray_bootstrap_key.pem from /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem
[3/7] No worker file mounts to sync
New status: setting-up
[4/7] No initialization commands to run.
[5/7] Initalizing command runner
Running `command -v docker || echo 'NoExist'`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (command -v docker || echo '"'"'NoExist'"'"')'`
Shared connection to 3.142.221.13 closed.
Running `docker pull rayproject/ray-ml:latest`
Full command is `ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray-ml:latest)'`
latest: Pulling from rayproject/ray-ml
c549ccf8d472: Pull complete
4037933347b0: Pull complete
3893765d06c1: Pull complete
80dd2646b97c: Pull complete
2d7a8f4b8186: Pull complete
14991ca1c4f1: Pull complete
8ebf2c29d653: Pull complete
31aec75da759: Pull complete
24442a832953: Pull complete
d91380edd589: Pull complete
eeac7b3a6ee2: Pull complete
171c6f068f3d: Downloading [=========================================> ] 2.899GB/3.484GB
Shared connection to 3.142.221.13 closed.
New status: update-failed
!!!
{'message': 'SSH command failed.'}
SSH command failed.
!!!
Failed to setup head node.
@jaredvann I don't think so :/ (I tried repro'ing from an EC2 machine). Do you know if it stalls at this and then stops? By this I mean does it freeze at:
171c6f068f3d: Downloading [=========================================> ] 2.899GB/3.484GB
and then close, or does it just close abruptly?
Error still occurs with version 1.5.1.
@ijrsvt the download bar freezes for about 10-15 seconds before showing the error message and the process exiting.
@jaredvann Hmm, not really sure why this keeps breaking for you. Could you try adding the following initialization_command
?
initializatio_commands:
- docker pull rayproject/ray-ml:latest
@ijrsvt no change.
@jaredvann , I ran the same repro as you are providing and everything just worked fine: maybe try a smaller image (rayproject/ray:latest)?
[5/7] Initalizing command runner
Shared connection to 107.23.178.9 closed.
latest: Pulling from rayproject/ray-ml
c549ccf8d472: Pull complete
4037933347b0: Pull complete
3893765d06c1: Pull complete
80dd2646b97c: Pull complete
2d7a8f4b8186: Pull complete
14991ca1c4f1: Pull complete
8ebf2c29d653: Pull complete
31aec75da759: Pull complete
24442a832953: Pull complete
d91380edd589: Pull complete
eeac7b3a6ee2: Pull complete
171c6f068f3d: Pull complete
Digest: sha256:4cd2e4233ba1891a0e907b33e337b6818c70806b5e4e75916a2d2fd7532ba86d
Status: Downloaded newer image for rayproject/ray-ml:latest
docker.io/rayproject/ray-ml:latest
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Shared connection to 107.23.178.9 closed.
2021-08-13 03:54:21,377 WARNING command_runner.py:904 -- Nvidia Container Runtime is present, but no GPUs found.
dc9e6e8ce6ac404221e89d56d2a7d131cf5ceca5273705658876099f200ebf43
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
Shared connection to 107.23.178.9 closed.
[6/7] Running setup commands
(0/1) pip install 'boto3>=1.4.8'
Requirement already satisfied: boto3>=1.4.8 in ./anaconda3/lib/python3.7/site-packages (1.4.8)
Requirement already satisfied: s3transfer<0.2.0,>=0.1.10 in ./anaconda3/lib/python3.7/site-packages (from boto3>=1.4.8) (0.1.13)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in ./anaconda3/lib/python3.7/site-packages (from boto3>=1.4.8) (0.10.0)
Requirement already satisfied: botocore<1.9.0,>=1.8.0 in ./anaconda3/lib/python3.7/site-packages (from boto3>=1.4.8) (1.8.50)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in ./anaconda3/lib/python3.7/site-packages (from botocore<1.9.0,>=1.8.0->boto3>=1.4.8) (2.8.1)
Requirement already satisfied: docutils>=0.10 in ./anaconda3/lib/python3.7/site-packages (from botocore<1.9.0,>=1.8.0->boto3>=1.4.8) (0.17.1)
Requirement already satisfied: six>=1.5 in ./anaconda3/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.9.0,>=1.8.0->boto3>=1.4.8) (1.15.0)
WARNING: You are using pip version 21.1.3; however, version 21.2.4 is available.
You should consider upgrading via the '/home/ray/anaconda3/bin/python -m pip install --upgrade pip' command.
Shared connection to 107.23.178.9 closed.
[7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 107.23.178.9 closed.
Local node IP: 172.31.21.89
2021-08-12 17:54:53,223 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
ssh -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/4817a4d972/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.142.221.13 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray-ml:latest
is the command that fails right? maybe you can do ssh -vv -tt -i /home/ubuntu/.ssh/ray-autoscaler_us-east-2.pem
this may help diagnosing the output.
@jaredvann were you able to try out @richardliaw 's suggestion?
This still happens in version 1.7.1 on gcp.
using image: "rayproject/ray:latest-gpu"
as well.
Any help?
I'm also encountering this with ray version 1.10.0 on GCP using image: "rayproject/ray:1.10.0-py38-cpu"
. Looks like I'm getting an unexpected EoF
towards the end up the docker pull
step when it's extracting downloads:
... etc.
345decc673e1: Extracting 1.093GB/2.187GB
c033e59f74d1: Download complete
unexpected EOF
Shared connection to <ip> closed.
2022-03-09 08:55:44,955 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation oper
ation-1646844944679-5d9cbf7ed82ee-ecef8ae4-f6858e65 to finish...
2022-03-09 08:55:50,293 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-164684
4944679-5d9cbf7ed82ee-ecef8ae4-f6858e65 finished.
New status: update-failed
!!!
SSH command failed.
!!!
Thanks in advance for your time!
@wuisawesome , can you please take a look/triage when you have a chance?
Upgrading to P1 since it sounds like it's affecting more users now.
@mataney @smacpher can either of you provide a full cluster yaml? or at least the AMI/machine image you're using (or if it's left unspecified, and the setup command you specify if any?)
Hi @wuisawesome thank you for taking a look at this issue. Here's my cluster yaml with some internal things like project ID and code paths redacted:
cluster_name: test
max_workers: 2
upscaling_speed: 1.0
docker:
image: <custom image that inherits from rayproject/ray:1.10.0-py38-cpu>
container_name: "ray_container"
pull_before_run: True
run_options:
- --ulimit nofile=65536:65536
idle_timeout_minutes: 5
provider:
type: gcp
region: us-west1
availability_zone: us-west1-a
project_id: <project-id>
auth:
ssh_user: ubuntu
available_node_types:
ray_head_default:
resources: { "CPU": 4 }
node_config:
machineType: n1-standard-4
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
head_node_type: ray_head_default
file_mounts:
{ <path/to/my_first_part_code>: <path/to/my_first_part_code> }
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands: []
setup_commands:
- export PYTHONPATH=$PYTHONPATH:<path/to/my_first_part_code>
head_setup_commands:
- pip install google-api-python-client==1.7.8
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--dashboard-port 8625
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
head_node: {}
worker_nodes: {}
I've noticed that this was happening more often for me on smaller machine types (e.g. n1-standard-2
) and also happens more often when using my custom image. I suspect it may have to do with my image having a couple large layers (~2 GBs). It cuts off during the extraction phase: 345decc673e1: Extracting 1.418GB/2.187GB
then I see an Unexpected EoF
and things fail.
When I bumped up to n1-standard-4
and larger, no more failures during the docker pull
step for me, but I started running into a new error. And I can consistently reproduce this using the exact YAML I attached:
... etc.
[6/7] Running setup commands
(0/2) export PYTHONPATH=$PYTHONPATH:...
Shared connection to 34.82.212.231 closed.
(1/2) pip install google-api-python-...
Requirement already satisfied: google-api-python-client==1.7.8 in ./anaconda3/lib/python3.8/site-packages (1.7.8)
Requirement already satisfied: uritemplate<4dev,>=3.0.0 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (3.0.1)
Requirement already satisfied: google-auth-httplib2>=0.0.3 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (0.1.0)
Requirement already satisfied: google-auth>=1.4.1 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (2.4.1)
Requirement already satisfied: httplib2<1dev,>=0.9.2 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (0.20.2)
Requirement already satisfied: six<2dev,>=1.6.1 in ./anaconda3/lib/python3.8/site-packages (from google-api-python-client==1.7.8) (1.13.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in ./anaconda3/lib/python3.8/site-packages (from google-auth>=1.4.1->google-api-python-client==1.7.8) (5.0.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in ./anaconda3/lib/python3.8/site-packages (from google-auth>=1.4.1->google-api-python-client==1.7.8) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4 in ./anaconda3/lib/python3.8/site-packages (from google-auth>=1.4.1->google-api-python-client==1.7.8) (4.8)
Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in ./anaconda3/lib/python3.8/site-packages (from httplib2<1dev,>=0.9.2->google-api-python-client==1.7.8) (3.0.7)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in ./anaconda3/lib/python3.8/site-packages (from pyasn1-modules>=0.2.1->google-auth>=1.4.1->google-api-python-client==1.7.8) (0.4.8)
Shared connection to 34.82.212.231 closed.
[7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 34.82.212.231 closed.
Error: No such container: ray_container
Shared connection to 34.82.212.231 closed.
2022-03-11 09:24:29,027 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647019468732-5d9f49a5f36f9-14e9d270-ed9bb0d9 to finish...
2022-03-11 09:24:34,310 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647019468732-5d9f49a5f36f9-14e9d270-ed9bb0d9 finished.
New status: update-failed
!!!
SSH command failed.
!!!
Note the error message: Error: No such container: ray_container
. Is it possible there's some sort of race condition where the setup commands start before the container has finished starting? (I may be way off here, I haven't dug into the autoscaler code yet.)
When I run ray up config.yaml -y
again, it succeeds, and much faster since most steps were already run by my first attempt.
This also happens on the head node when trying to use the autoscaler to add more nodes. Sometimes after retrying it succeeds, but it's inconsistent.
Thanks again for your time! And let me know if I can help in anyway.
I think I am having the same issue on GCP:
Cluster: gpu-docker-a100-jose
2022-03-16 10:38:15,706 INFO util.py:278 -- setting max workers for head node type to 0
Checking GCP environment settings
2022-03-16 10:38:16,374 INFO config.py:451 -- _configure_key_pair: Private key not specified in config, using/home/jupyter/.ssh/ray-autoscaler_gcp_us-central1_<project_id>_ubuntu_0.pem
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
Acquiring an up-to-date head node
2022-03-16 10:38:18,497 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427096700-5da5382e4653d-ff584f4b-70b16514 to finish...
2022-03-16 10:38:29,097 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427096700-5da5382e4653d-ff584f4b-70b16514 finished.
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
2022-03-16 10:38:29,934 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427109482-5da5383a77032-355f78e7-90f5d570 to finish...
2022-03-16 10:38:35,350 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427109482-5da5383a77032-355f78e7-90f5d570 finished.
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Fetched IP: 104.154.73.96
ssh: connect to host 104.154.73.96 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 104.154.73.96 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '104.154.73.96' (ECDSA) to the list of known hosts.
10:39:08 up 0 min, 1 user, load average: 0.79, 0.19, 0.06
Shared connection to 104.154.73.96 closed.
Success.
Updating cluster configuration. [hash=73d374e9aed5abf00300cb49f327feb32491e4ee]
2022-03-16 10:39:08,632 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427148195-5da5385f6259b-91093f21-84dd5f09 to finish...
2022-03-16 10:39:14,054 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427148195-5da5385f6259b-91093f21-84dd5f09 finished.
New status: syncing-files
[2/7] Processing file mounts
Shared connection to 104.154.73.96 closed.
Shared connection to 104.154.73.96 closed.
[3/7] No worker file mounts to sync
2022-03-16 10:39:19,821 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427159340-5da5386a036d4-d4e8ba7b-470baea2 to finish...
2022-03-16 10:39:25,204 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427159340-5da5386a036d4-d4e8ba7b-470baea2 finished.
New status: setting-up
[4/7] Running initialization commands
Warning: Permanently added '104.154.73.96' (ECDSA) to the list of known hosts.
WARNING: Your config file at [/home/ubuntu/.docker/config.json] contains these credential helper entries:
{
"credHelpers": {
"gcr.io": "gcloud",
"us.gcr.io": "gcloud",
"eu.gcr.io": "gcloud",
"asia.gcr.io": "gcloud",
"staging-k8s.gcr.io": "gcloud",
"marketplace.gcr.io": "gcloud"
}
}
Adding credentials for: us-central1-docker.pkg.dev
Docker configuration file updated.
Connection to 104.154.73.96 closed.
[5/7] Initalizing command runner
Shared connection to 104.154.73.96 closed.
latest-gpu: Pulling from rayproject/ray-ml
55322776: Pulling fs layer
743589ce: Pulling fs layer
3cedd2c6: Pulling fs layer
39d99446: Pulling fs layer
39450a51: Pulling fs layer
0854cdb4: Pulling fs layer
257910db: Pulling fs layer
9b80ddf2: Pulling fs layer
ec0a3755: Pulling fs layer
51c3bfb6: Pulling fs layer
91efdf9f: Pulling fs layer
7197939e: Pulling fs layer
b24f58c2: Pulling fs layer
8cc981d5: Pulling fs layer
72d152ed: Pulling fs layer
9065bada: Pulling fs layer
c8e1064c: Pulling fs layer
f7a46693: Pulling fs layer
04d31d20: Pulling fs layer
10cc91f4: Pulling fs layer
eac638d3: Pulling fs layer
53b8f2f8: Pulling fs layer
145a2924: Pulling fs layer
122cddae: Pulling fs layer
6a1c067d: Pulling fs layer
b80ddf2: Extracting 1.001GB/1.146GBBunexpected EOF
Shared connection to 104.154.73.96 closed.
2022-03-16 10:42:24,758 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1647427344128-5da5391a3d926-91dd9ad3-ea09ff8f to finish...
2022-03-16 10:42:30,129 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1647427344128-5da5391a3d926-91dd9ad3-ea09ff8f finished.
New status: update-failed
!!!
SSH command failed.
!!!
Failed to setup head node.
Re-running ray up cluster.yml
again works fine
Re: melonipoika
Running on GCP, we used to see this, but after adding bash -c $'ps -e | grep apt | awk \'{print $1}\' | xargs tail -f --pid || true' # Wait for auto upgrade that might run in the background.
first in initialization_commands
this went away. It seems that GCP runs an auto-upgrade in the background on the node that can interfere with things like docker pull
.
@oscartackstrom I've added the line to the initialization commands as you suggested, but when I then run ray up, it fails because of a syntax error caused by the curly braces in {print $1}. What am I doing wrong? Just removing the curly braces fixes the syntax error, but then starting the head node fails with an SSH problem just as before.
Upvote from me also.
In Ray 1.12.1 while running on GCP I am also experiencing similar errors, even with the initialization command from @oscartackstrom it doesn't work properly.
I also experienced this error but got it working by re-running. Details:
I started with the instructions to launch ray on GCP.
On first run of ray up example-full.yaml
I got:
.....
Shared connection to 35.223.182.31 closed.
latest-gpu: Pulling from rayproject/ray-ml
09db6f815738: Pull complete
d79696845ef2: Pull complete
9cace1db9258: Pull complete
5093a1370488: Pull complete
affabbc9735b: Pull complete
839e92906efc: Pull complete
36d15b49ae4c: Pull complete
be6750df422d: Pull complete
02a4c72adbe9: Pull complete
cc3c4b345c51: Pull complete
f87c4242b0c3: Pull complete
fa3dc8316827: Pull complete
56af5a0d16a1: Pull complete
e6f237e89986: Pull complete
53d40d13bcdf: Extracting [=> ] 8.913MB/245.2MB
ef63e11c9202: Download complete
977a683aa960: Download complete
e71a2f98112f: Download complete
d1835b6f204b: Download complete
c55db8918060: Download complete
106e9528714d: Download complete
2309e5741e2a: Download complete
57cf067a6500: Download complete
115e34d38940: Download complete
f52b4c403e2d: Download complete
18684af57161: Download complete
ae14a5017bf5: Download complete
8333c436436b: Download complete
unexpected EOF
Shared connection to 35.223.182.31 closed.
2022-10-16 08:40:57,081 INFO node.py:311 -- wait_for_compute_zone_operation: Waiting for operation operation-1665924056563-5eb262b9d4053-fc3deff7-df7b529a to finish...
2022-10-16 08:41:02,484 INFO node.py:330 -- wait_for_compute_zone_operation: Operation operation-1665924056563-5eb262b9d4053-fc3deff7-df7b529a finished.
New status: update-failed
!!!
SSH command failed.
!!!
Failed to setup head node.
Then after re-running with verbose debug statements (ray up example-full.yaml -vvvv
), it seems to have worked. The initial prompt mentioned "re-starting":
$ ray up example-full.yaml -vvvv
Cluster: default
2022-10-16 08:48:16,736 INFO util.py:364 -- setting max workers for head node type to 0
Loaded cached provider configuration from /tmp/ray-config-0fd8db53e13c59f92d610a668ec5757f192b7c1d
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y
Eventually it was able to download the image and start things up:
--------------------
Ray runtime started.
--------------------
Wonder if it is some kind of timeout issue that was able to pick up from where it left off?
My Ray head node fails even earlier, after running ray up config.yaml
I get:
cquiring an up-to-date head node
Launched 1 nodes [network_interfaces=[{'DeviceIndex': 0, 'AssociatePublicIpAddress': False, 'SubnetId': '...', 'Groups': ['...']}]]
Launched instance i-068b76f0a85130105 [state=pending, info=pending]
Launched a new head node
Fetching the new head node
Head node fetch timed out. Failed to create head node.
Presumably related to the hard-coded timeout of 50s here -
https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/commands.py#L746
I then have to run ray up config.yaml
again and it works fine.
Continuing to debug this - The call that is failing is the docker pull
command that takes place on the head node when the cluster is initializing:
I suspect running sudo dockerd --max-download-attempts 10
before docker is started could fix this issue - trying to figure out how/where to do that.
maybe you can try putting it into the initialization_commands? that runs before the docker pull/.
On Sun, Oct 23, 2022 at 1:48 PM Paul Fenton @.***> wrote:
Continuing to debug this - The call that is failing is the docker pull command that takes place on the head node when the cluster is initializing:
I suspect running sudo dockerd --max-download-attempts 10 before docker is started could fix this issue - trying to figure out how/where to do that.
— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/15893#issuecomment-1288198048, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZLNTCJL7NYEQL32F2TWEWQA7ANCNFSM45DO7KJA . You are receiving this because you were mentioned.Message ID: @.***>
cc @wuisawesome
I've had the same issue, the thing is even if you rerun and it works fine for the head, it will do the same for the worker nodes and hence putting them in a constant state of "pending".
Re: melonipoika
Running on GCP, we used to see this, but after adding
bash -c $'ps -e | grep apt | awk \'{print $1}\' | xargs tail -f --pid || true' # Wait for auto upgrade that might run in the background.
first ininitialization_commands
this went away. It seems that GCP runs an auto-upgrade in the background on the node that can interfere with things likedocker pull
.
This works like a charm for GCP, Thank you @oscartackstrom ! Only downfall, it takes a bit more time knowing that it has to "wait".
2023, version 2.6.3 - still the same on AWS. (+many, many other issues - is anybody working on ray up
on AWS???)
# use: ray up -y ray-cluster-config.yaml to start it
cluster_name: test
max_workers: 16
setup_commands:
# unattended upgrades is a mess; also problems with APT, "uninstall" manually
- "[ -f ~/.ok ] || (while ! sudo rm /usr/bin/unattended-upgrade; do sleep 1; done)"
- "[ -f ~/.ok ] || (sudo killall -9 unattended-upgrade || true)" # just in case
- touch ~/.ok
# installed wrong version!
- pip install "ray[default]==2.6.3"
provider:
type: aws
region: us-east-1
cache_stopped_nodes: False
available_node_types:
ray.head.default:
node_config:
InstanceType: m5.large
ray.worker.default:
max_workers: 16
node_config:
InstanceType: m5.large
InstanceMarketOptions:
MarketType: spot
Adding 'core' and 'triage' to catch in next week's weekly GH triage at Anyscale; sorry for the delay folks.
Hi @iirekm, are you able to use the latest Ray instead of 2.6.3. I tried
# use: ray up -y ray-cluster-config.yaml to start it
cluster_name: test
max_workers: 16
setup_commands:
# unattended upgrades is a mess; also problems with APT, "uninstall" manually
- "[ -f ~/.ok ] || (while ! sudo rm /usr/bin/unattended-upgrade; do sleep 1; done)"
- "[ -f ~/.ok ] || (sudo killall -9 unattended-upgrade || true)" # just in case
- touch ~/.ok
# installed wrong version!
- pip install "ray[default]"
provider:
type: aws
region: us-east-1
cache_stopped_nodes: False
available_node_types:
ray.head.default:
node_config:
InstanceType: m5.large
ray.worker.default:
max_workers: 16
node_config:
InstanceType: m5.large
InstanceMarketOptions:
MarketType: spot
and it works.
What is the problem?
Ray 1.3.0, Python 3.8.5, Ubuntu 20.04
Running
ray up
to deploy a cluster to AWS starts a head node but disconnects during pulling of Docker images.Repeating multiple times with new clusters leads to the same result.
Reproduction
Config file: