ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.1k stars 5.6k forks source link

Ray cluster launcher on existing GCP VMs, head node unable to SSH to worker nodes #34838

Open sudharshankakumanu opened 1 year ago

sudharshankakumanu commented 1 year ago

I am using cluster launcher yaml to launch a ray cluster on existing GCP VMs.

setup:

ray version: 1.1.0 on local machine
using ray docker on the VMs: ray:2.1.0-py38-cu116
python: 3.8

Running ray up config.yaml from local machine only sets up head node and worker nodes are untouched.

I have private-public key pair generated on my local machine that is authorized to ssh into all the VMs and used the same key in config.yaml.

verified that I am able to ssh to all the VMs from my local using ssh -i ~/.ssh/gcloud_instance kakumanu@<ext-ip-address>

config.yml

# An unique identifier for the head node and workers of this cluster.
cluster_name: text-2-img-manual
docker:
    image: "rayproject/ray:2.1.0-py38-cu116" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

provider:
    type: local
    head_ip: 34.145.253.198
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: [35.245.121.222,35.221.57.214]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: kakumanu
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    ssh_private_key: ~/.ssh/gcloud_instance

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
#min_workers: 1

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
#max_workers: 1
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
    "/mining": "..",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands:
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
    - sudo apt-get update
    - sudo apt-get install ffmpeg libsm6 libxext6  -y.....

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Tried removing ssh_private_key: ~/.ssh/gcloud_instance from the config, but now I am unable to SSH to head node (which makes sense, since my local machine needs private key to ssh into head node).

from the docs and issue here, it is not clear if the private key in the config.yaml must be a key generated on local machine or the head node ?

Monitoring the logs, it looks like ray_bootstrap_key.pem is used as the private key by the head node to SSH into workers, but the corresponding public key is not authorized on the worker nodes.

(python385) kakumanu-macOS:ray kakumanu$ ray exec /Users/kakumanu/ray-batch-jobs/ray/text-2-img-manual-config.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2023-04-27 12:42:40,738 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['34.145.253.198', '35.245.121.222', '35.221.57.214']
Fetched IP: 34.145.253.198
Warning: Permanently added '34.145.253.198' (ED25519) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Warning: Permanently added '35.221.57.214' (ECDSA) to the list of known hosts.
Warning: Permanently added '35.245.121.222' (ECDSA) to the list of known hosts.
kakumanu@35.221.57.214: Permission denied (publickey).
kakumanu@35.245.121.222: Permission denied (publickey).
Warning: Permanently added '35.221.57.214' (ECDSA) to the list of known hosts.

tried adding the ~/.ssh/gcloud_instance to the head node and I was able to manually ssh into worker nodes from the head node.

I would like to leverage the cluster launcher so I can avoid manual setup on 25 node large cluster and a large codebase to rsync. Please do advice on any possible solutions.

rkooo567 commented 1 year ago

Ray 1.1 seems pretty old. can you use the newer version for your local machine?

sudharshankakumanu commented 1 year ago

updated to 2.4.0, but no luck

zakajd commented 1 year ago

I observe same problem, but on AWS setup with Rat 2.3.1. @sudharshankakumanu did you find a way to fix it? Here is my config file (redacted in a few places for privacy)

cluster_name: cluster-dev

max_workers: 2
upscaling_speed: 1.0

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: True # Cache stopped nodes to speed up future starts.

docker:
  head_image: <head-image-in-ECR-repo-tag>
  worker_image: <worker-image-in-ECR-repo-tag>
  container_name: ray-ml
  pull_before_run: True
  run_options:
  run_options:
    - --ulimit nofile=65536:65536
    - --env DATABASE_NAME=dev
  disable_automatic_runtime_detection: True
  disable_shm_size_detection: False

auth:
  ssh_user: ubuntu
  ssh_private_key: ~/Downloads/aws-cluster-key.pem

available_node_types:
  ray.head.default:
    node_config:
      LaunchTemplate:
        LaunchTemplateId: lt-09581174b5746050a
        Version: $Latest
      InstanceType: t2.small
    # We don't want any tasks to be scheduled on the head node.
    resources: { "CPU": 0, "GPU": 0 }
    min_workers: 1

  ray.worker.default:
    node_config:
      LaunchTemplate:
        LaunchTemplateId: lt-0c02f4a411183ef71
        Version: $Latest
      InstanceType: t2.xlarge
    min_workers: 0
    max_workers: 2

# Define the name of the head node from above here.
head_node_type: ray.head.default

initialization_commands:
  # Wait for apt-get to be available. (recommendation from https://github.com/ray-project/ray/issues/15893)
  - bash -c $'ps -e | grep apt | awk \'{print $1}\' | xargs tail -f --pid || true
  # Login to ECR so we can pull images.
  - aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <private-ECR-repo>;

setup_commands: []
head_setup_commands: []
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  - ray stop
  - >-
    ray start
    --head
    --port=6379
    --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml
    --dashboard-host=0.0.0.0 
    --dashboard-agent-listen-port=52365
    --disable-usage-stats 

worker_start_ray_commands:
  - ray stop
  - >-
    ray start
    --address=$RAY_HEAD_IP:6379
    --object-manager-port=8076
    --disable-usage-stats
sudharshankakumanu commented 1 year ago

@zakajd in your case looks like you are auto launching the nodes. Whereas I am trying to use the cluster launcher to launch Ray on existing nodes. You can check if your head node has the right INSTANCE PROFILE to launch workers.

    ray.head.default:
         resources: {"CPU": 1, "GPU": 1, "custom": 5}
        node_config:
            InstanceType: g5.12xlarge
            ImageId: ami-07135...
            IamInstanceProfile:
                Arn: arn:aws:iam::1234567890:instance-profile/head-node-instnace-profile
stale[bot] commented 11 months ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

jacksonjacobs1 commented 5 months ago

I am experiencing the same problem trying to launch a ray cluster on existing ubuntu nodes.

When I run ray up local_cluster_config.yaml --no-config-cache, the head node docker container is instantiated with all of the necessary files (.pem, .config), but the worker nodes are stuck in the "launching" state.

I verified that from within the head node I can ssh into the worker nodes from the head node without entering a password or host-checking information.

(base) ray@HEAD_IP:~$ ssh -i ray_bootstrap_key.pem USERNAME@WORKER2_IP

Here's my cluster launcher config:


cluster_name: my_cluster

docker:
    image: rayproject/ray:latest-cpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "hqc_ray"
    pull_before_run: False
    run_options: [] # Extra options to pass into "docker run"

provider:
    type: local
    head_ip: HEAD_IP

    worker_ips: [WORKER1_IP, WORKER2_IP]

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: USERNAME
    ssh_private_key: ~/.ssh/id_rsa

min_workers: 2

max_workers: 2

upscaling_speed: 1.0

idle_timeout_minutes: 30

file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

cluster_synced_files: []

file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands: ['mkdir ~/.ssh', 'echo -e "Host *\n\tStrictHostKeyChecking no\n\tUserKnownHostsFile /dev/null" >> ~/.ssh/config']

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --dashboard-port=8265 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379