Open sudharshankakumanu opened 1 year ago
Ray 1.1 seems pretty old. can you use the newer version for your local machine?
updated to 2.4.0, but no luck
I observe same problem, but on AWS setup with Rat 2.3.1. @sudharshankakumanu did you find a way to fix it? Here is my config file (redacted in a few places for privacy)
cluster_name: cluster-dev
max_workers: 2
upscaling_speed: 1.0
provider:
type: aws
region: us-east-1
cache_stopped_nodes: True # Cache stopped nodes to speed up future starts.
docker:
head_image: <head-image-in-ECR-repo-tag>
worker_image: <worker-image-in-ECR-repo-tag>
container_name: ray-ml
pull_before_run: True
run_options:
run_options:
- --ulimit nofile=65536:65536
- --env DATABASE_NAME=dev
disable_automatic_runtime_detection: True
disable_shm_size_detection: False
auth:
ssh_user: ubuntu
ssh_private_key: ~/Downloads/aws-cluster-key.pem
available_node_types:
ray.head.default:
node_config:
LaunchTemplate:
LaunchTemplateId: lt-09581174b5746050a
Version: $Latest
InstanceType: t2.small
# We don't want any tasks to be scheduled on the head node.
resources: { "CPU": 0, "GPU": 0 }
min_workers: 1
ray.worker.default:
node_config:
LaunchTemplate:
LaunchTemplateId: lt-0c02f4a411183ef71
Version: $Latest
InstanceType: t2.xlarge
min_workers: 0
max_workers: 2
# Define the name of the head node from above here.
head_node_type: ray.head.default
initialization_commands:
# Wait for apt-get to be available. (recommendation from https://github.com/ray-project/ray/issues/15893)
- bash -c $'ps -e | grep apt | awk \'{print $1}\' | xargs tail -f --pid || true
# Login to ECR so we can pull images.
- aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <private-ECR-repo>;
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
--dashboard-host=0.0.0.0
--dashboard-agent-listen-port=52365
--disable-usage-stats
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
--disable-usage-stats
@zakajd in your case looks like you are auto launching the nodes. Whereas I am trying to use the cluster launcher to launch Ray on existing nodes. You can check if your head node has the right INSTANCE PROFILE to launch workers.
ray.head.default:
resources: {"CPU": 1, "GPU": 1, "custom": 5}
node_config:
InstanceType: g5.12xlarge
ImageId: ami-07135...
IamInstanceProfile:
Arn: arn:aws:iam::1234567890:instance-profile/head-node-instnace-profile
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
I am experiencing the same problem trying to launch a ray cluster on existing ubuntu nodes.
When I run ray up local_cluster_config.yaml --no-config-cache
, the head node docker container is instantiated with all of the necessary files (.pem, .config), but the worker nodes are stuck in the "launching" state.
I verified that from within the head node I can ssh into the worker nodes from the head node without entering a password or host-checking information.
(base) ray@HEAD_IP:~$ ssh -i ray_bootstrap_key.pem USERNAME@WORKER2_IP
Here's my cluster launcher config:
cluster_name: my_cluster
docker:
image: rayproject/ray:latest-cpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "hqc_ray"
pull_before_run: False
run_options: [] # Extra options to pass into "docker run"
provider:
type: local
head_ip: HEAD_IP
worker_ips: [WORKER1_IP, WORKER2_IP]
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: USERNAME
ssh_private_key: ~/.ssh/id_rsa
min_workers: 2
max_workers: 2
upscaling_speed: 1.0
idle_timeout_minutes: 30
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
cluster_synced_files: []
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up each nodes.
setup_commands: ['mkdir ~/.ssh', 'echo -e "Host *\n\tStrictHostKeyChecking no\n\tUserKnownHostsFile /dev/null" >> ~/.ssh/config']
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
# If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
# In that case we'd have to activate that env on each node before running `ray`:
# - conda activate my_venv && ray stop
# - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
- ray stop
- ulimit -c unlimited && ray start --head --port=6379 --dashboard-port=8265 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
# If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
# In that case we'd have to activate that env on each node before running `ray`:
# - conda activate my_venv && ray stop
# - ray start --address=$RAY_HEAD_IP:6379
- ray stop
- ray start --address=$RAY_HEAD_IP:6379
I am using cluster launcher yaml to launch a ray cluster on existing GCP VMs.
setup:
Running
ray up config.yaml
from local machine only sets up head node and worker nodes are untouched.I have private-public key pair generated on my local machine that is authorized to ssh into all the VMs and used the same key in config.yaml.
verified that I am able to ssh to all the VMs from my local using
ssh -i ~/.ssh/gcloud_instance kakumanu@<ext-ip-address>
config.yml
Tried removing
ssh_private_key: ~/.ssh/gcloud_instance
from the config, but now I am unable to SSH to head node (which makes sense, since my local machine needs private key to ssh into head node).from the docs and issue here, it is not clear if the private key in the config.yaml must be a key generated on local machine or the head node ?
Monitoring the logs, it looks like
ray_bootstrap_key.pem
is used as the private key by the head node to SSH into workers, but the corresponding public key is not authorized on the worker nodes.tried adding the
~/.ssh/gcloud_instance
to the head node and I was able to manually ssh into worker nodes from the head node.I would like to leverage the cluster launcher so I can avoid manual setup on 25 node large cluster and a large codebase to rsync. Please do advice on any possible solutions.