Cannot start a simple local cluster using the config.yaml - workers are not found

jav-ed commented 11 months ago

What happened + What you expected to happen

I have multiple pcs that are connected and can be accesses easily through ssh. Going manually inside a pc, that is the node, and defining it to be the head or the worker is working fine. The issue arises, when I try to do the very same thing using the config.yaml.

First, the manual procedure:

ssh into a node that shall be the head
activate the virtual environment
ray start --head --port=6379

now ssh into all the other machines that shall be the workers and perform ray start --address=

Using ray status or viewing the dashboard, it can be observed that all the desired nodes are online.

Now this shall be replicated with a config.yaml. However, sometimes when I have luck it will find the workers and mostly it will not find the workers.

cluster_name: default
provider:
    type: local
    head_ip: ilrpoollin04
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: [ilrpoollin08, ilrpoollin09]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: DOM+jabu413e

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5
cluster_synced_files: []

file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
    # - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  

# List of shell commands to run to set up each nodes.
setup_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: 
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: 
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray stop
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate &&  ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray stop
    # - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray start --address=$RAY_HEAD_IP:6379
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray start --address=141.30.159.38:6379

Versions / Dependencies

(py_P_Bert) ➜  0_Yamls git:(main) ✗ python --version
Python 3.9.18
(py_P_Bert) ➜  0_Yamls git:(main) ✗ ray --version
ray, version 2.9.0
(py_P_Bert) ➜  0_Yamls git:(main) ✗ lsb_release -a
LSB Version:    core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0.fake-amd64:desktop-4.0.fake-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0.fake-amd64:graphics-4.0.fake-noarch
Distributor ID: openSUSE
Description:    openSUSE Leap 15.5
Release:    15.5
Codename:   n/a

Reproduction script

Please see the description above, that is the config.yaml

Issue Severity

Medium: It is a significant difficulty but I can work around it.

anyscalesam commented 10 months ago

@architkulkarni can you review and triage?

millefalcon commented 5 months ago

@anyscalesam Hello Folks, We're facing the same issue. Any updates or suggestions to working around this ? Thanks

millefalcon commented 4 months ago

Hello folks, I have found that if we use ray force --stop on both head and worker start ray commands, it seems to work. Also, had to follow https://github.com/ray-project/ray/issues/39565#issuecomment-1846595876 for the worker to start next time, if I had shutdown the cluster(have to manually down the worker for it stop) previously.

https://github.com/ray-project/ray/issues/46204 https://github.com/ray-project/ray/issues/45571 seems related.

pratos commented 3 months ago

@millefalcon I tried using ray stop --force before the head and worker start commands as you suggested, but haven't been able to setup the worker node. I've a similar yaml file as presented in the issue. Can you share your yaml file and probably steps that you performed?

millefalcon commented 3 months ago

@pratos I don't have the exact yaml at the moment, but it is mostly similar to example-full.yaml(local). The main difference was I tried ray stop --force instead of just ray stop for both the head node and the workers.

I followed the exact steps as mentioned here https://github.com/ray-project/ray/issues/39565#issuecomment-1846595876.

Note: In hindsight, it only worked intermittently. I'd to write a wrapper script that ssh into the worker nodes and do the ray stop;..ray start ... etc to make it work always.

So I guess, it didn't exactly fully fix my issue, sorry.

ray-project / ray