ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.97k stars 5.77k forks source link

[ray local cluster] nodes marked as uninitialized #39565

Open jmakov opened 1 year ago

jmakov commented 1 year ago

What happened + What you expected to happen

Running ray up ray.yaml I'd expect that all of the 4 nodes would be setup and join the cluster as I've set min_workers: 4. ray monitor ray.yaml is showing the nodes as uninitialized though.

Versions / Dependencies

ray 2.6.4 python 3.9.18 manjaro

Reproduction script

ray.yaml

# A unique identifier for the head node and workers of this cluster.
cluster_name: test

# Running Ray in Docker images is optional (this docker section can be commented out).
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
#docker:
#    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
#    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
#    container_name: "ray_container"
#    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
#    # if no cached version is present.
#    pull_before_run: True
#    run_options:   # Extra options to pass into "docker run"
#        - --ulimit nofile=65536:65536

provider:
    type: local
    head_ip: 192.168.0.101
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips:
      - 192.168.0.106
      - 192.168.0.107
      - 192.168.0.108
      - 192.168.0.110
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: myuser
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    # ssh_private_key: ~/.ssh/id_rsa

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
min_workers: 4

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
#max_workers: 4
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH. E.g. you could save your conda env to an environment.yaml file, mount
# that directory to all nodes and call `conda -n my_env -f /path1/on/remote/machine/environment.yaml`. In this
# example paths on all nodes must be the same (so that conda can be called always with the same argument)
file_mounts: {
    "/mnt/ray": ".",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands:
    # If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
    # work environment on each worker by:
    #   1. making sure each worker has access to this file i.e. see the `file_mounts` section
    #   2. adding a command here that creates a new conda environment on each node or if the environment already exists,
    #     it updates it:
    #      conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
    #
    # Ray developers:
    # you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && mamba env update -f /mnt/ray/env.yaml --prune

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && conda activate test && ray stop
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && conda activate test && ulimit -c unlimited && ray start --head --disable-usage-stats --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --system-config='{"automatic_object_spilling_enabled":true,"max_io_workers":8,"min_spilling_size":104857600,"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/mnt/ray/object_spilling\"}}"}'

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && conda activate test && ray stop
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && conda activate test && ulimit -c unlimited && ray start --address=$RAY_HEAD_IP:6379 --disable-usage-stats

Issue Severity

High: It blocks me from completing my task.

rkooo567 commented 1 year ago

cc @rickyyx can you follow up with the investigation?

    type: local
    head_ip: 192.168.0.101
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips:
      - 192.168.0.106
      - 192.168.0.107
      - 192.168.0.108
      - 192.168.0.110

Can you tell us what this exactly for?

rickyyx commented 1 year ago

Hey @jmakov - will you be able to get any monitor.* logs generated? That would be helpful to debug.

jmakov commented 1 year ago

Didn't see anything exciting happening there, only monitor.log has some entries:

``` 2023-09-22 21:37:07,546 INFO monitor.py:699 -- Starting monitor using ray installation: /home/jernej_m/mambaforge-pypy3/envs/test_ray/lib/python3.10/site-packages/ray/__init__.py 2023-09-22 21:37:07,546 INFO monitor.py:700 -- Ray version: 2.6.3 2023-09-22 21:37:07,546 INFO monitor.py:701 -- Ray commit: {{RAY_COMMIT_SHA}} 2023-09-22 21:37:07,546 INFO monitor.py:702 -- Monitor started with command: ['/home/jernej_m/mambaforge-pypy3/envs/test_ray/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2023-09-22_21-37-05_827384_110848/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=192.168.0.101:6379', '--autoscaling-config=~/ray_bootstrap_config.yaml', '--monitor-ip=192.168.0.101'] 2023-09-22 21:37:07,552 INFO monitor.py:167 -- session_name: session_2023-09-22_21-37-05_827384_110848 2023-09-22 21:37:07,554 INFO monitor.py:199 -- Starting autoscaler metrics server on port 44217 2023-09-22 21:37:07,556 INFO monitor.py:224 -- Monitor: Started 2023-09-22 21:37:07,571 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: [] 2023-09-22 21:37:07,572 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101'] 2023-09-22 21:37:07,572 INFO autoscaler.py:274 -- disable_node_updaters:False 2023-09-22 21:37:07,572 INFO autoscaler.py:282 -- disable_launch_config_check:False 2023-09-22 21:37:07,572 INFO autoscaler.py:294 -- foreground_node_launch:False 2023-09-22 21:37:07,572 INFO autoscaler.py:304 -- worker_liveness_check:True 2023-09-22 21:37:07,572 INFO autoscaler.py:312 -- worker_rpc_drain:True 2023-09-22 21:37:07,573 INFO autoscaler.py:362 -- StandardAutoscaler: {'cluster_name': 'test', 'auth': {'ssh_user': 'jernej_m', 'ssh_private_key': '~/ray_bootstrap_key.pem'}, 'upscaling_speed': 1.0, 'idle_timeout_minutes': 5, 'docker': {}, 'initialization_commands': [], 'setup_commands': ['source ~/mambaforge-pypy3/etc/profile.d/conda.sh && mamba env update -f /mnt/ray/mount/env.yaml -n test_ray --prune'], 'head_setup_commands': ['source ~/mambaforge-pypy3/etc/profile.d/conda.sh && mamba env update -f /mnt/ray/mount/env.yaml -n test_ray --prune'], 'worker_setup_commands': ['source ~/mambaforge-pypy3/etc/profile.d/c> 2023-09-22 21:37:07,574 INFO monitor.py:394 -- Autoscaler has not yet received load metrics. Waiting. 2023-09-22 21:37:12,588 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes. 2023-09-22 21:37:12,588 INFO load_metrics.py:161 -- LoadMetrics: Removed ip: 192.168.0.108. 2023-09-22 21:37:12,588 INFO load_metrics.py:164 -- LoadMetrics: Removed 1 stale ip mappings: {'192.168.0.108'} not in {'192.168.0.101'} 2023-09-22 21:37:12,589 INFO autoscaler.py:421 -- ======== Autoscaler status: 2023-09-22 21:37:12.589294 ======== Node status --------------------------------------------------------------- Healthy: 1 local.cluster.node Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/32.0 CPU 0.0/2.0 GPU 0B/77.60GiB memory 0B/37.25GiB object_store_memory Demands: (no resource demands) 2023-09-22 21:37:12,590 INFO autoscaler.py:1368 -- StandardAutoscaler: Queue 4 new nodes for launch 2023-09-22 21:37:12,590 INFO autoscaler.py:464 -- The autoscaler took 0.003 seconds to complete the update iteration. 2023-09-22 21:37:12,591 INFO node_launcher.py:174 -- NodeLauncher0: Got 4 nodes to launch. 2023-09-22 21:37:12,592 INFO monitor.py:424 -- :event_summary:Resized to 56 CPUs, 4 GPUs. 2023-09-22 21:37:12,594 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101'] 2023-09-22 21:37:12,594 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101'] 2023-09-22 21:37:12,595 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101'] 2023-09-22 21:37:12,596 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101'] 2023-09-22 21:37:12,596 INFO node_launcher.py:174 -- NodeLauncher0: Launching 4 nodes, type local.cluster.node. 2023-09-22 21:37:17,608 INFO autoscaler.py:141 -- The autoscaler took 0.001 seconds to fetch the list of non-terminated nodes. 2023-09-22 21:37:17,609 INFO autoscaler.py:421 -- ======== Autoscaler status: 2023-09-22 21:37:17.609649 ======== Node status --------------------------------------------------------------- Healthy: 2 local.cluster.node Pending: 192.168.0.106: local.cluster.node, uninitialized 192.168.0.107: local.cluster.node, uninitialized 192.168.0.110: local.cluster.node, uninitialized Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/56.0 CPU 0.0/4.0 GPU 0B/98.01GiB memory 0B/46.00GiB object_store_memory Demands: (no resource demands) 2023-09-22 21:37:17,619 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.106. 2023-09-22 21:37:17,620 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.107. 2023-09-22 21:37:17,620 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.108. 2023-09-22 21:37:17,620 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.110. ```

Running everything manually works. Would be nice to have a working cluster launcher for on prem clusters.

ajaichemmanam commented 1 year ago

+1 same issue for me. Even with systems on cloud (3rd party cloud, not AWS/GCS/Azure). Opened all ports, sometimes it gets connected, some times it shows uninitialized.

rickyyx commented 1 year ago

cc @gvspraveen could someone from the cluster team help take a look? I believe this is more relevant to cluster launcher as of now rather than the actual autoscaling logics since "running everything manually works".

jmakov commented 1 year ago

@rickyyx not to mention manually starting ray not working and cluster launcher not working. Wondering how ray works at all for anybody. As someone who uses ray for more than a year, every other release breaks a core part.

rkooo567 commented 1 year ago

cc @anyscalesam can you triage this issue with @gvspraveen?

architkulkarni commented 1 year ago

I'm able to reproduce this on AWS using pip install "ray[default]"==2.7.0 in the setup commands and using the latest ray master on the client side for the cluster launcher.[see below, it was just a port issue on my end]

@jmakov do you happen to remember if this was working for you on a previous version of Ray, and if so which one?

jmakov commented 1 year ago

So cluster launcher worked for me for the last +2 years using a local cluster (without Docker, just conda env). Think it was 2.6.0 before I made the mistake of upgrading, if I remember correctly. Think I'll just start writing my own tests and run before every upgrade.

ajaichemmanam commented 1 year ago
`2023-10-09 11:46:28,208 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['216.48.179.215', '164.52.201.70']
Fetched IP: 164.52.201.70
Warning: Permanently added '164.52.201.70' (ED25519) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==

==> /tmp/ray/session_latest/logs/monitor.log <==
2023-10-08 23:13:33,485 INFO monitor.py:690 -- Starting monitor using ray installation: /home/ray/anaconda3/lib/python3.11/site-packages/ray/__init__.py
2023-10-08 23:13:33,485 INFO monitor.py:691 -- Ray version: 2.7.1
2023-10-08 23:13:33,485 INFO monitor.py:692 -- Ray commit: 9f07c12615958c3af3760604f6dcacc4b3758a47
2023-10-08 23:13:33,486 INFO monitor.py:693 -- Monitor started with command: ['/home/ray/anaconda3/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2023-10-08_23-13-32_012785_2484/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=164.52.201.70:6379', '--autoscaling-config=/home/ray/ray_bootstrap_config.yaml', '--monitor-ip=164.52.201.70']
2023-10-08 23:13:33,489 INFO monitor.py:159 -- session_name: session_2023-10-08_23-13-32_012785_2484
2023-10-08 23:13:33,490 INFO monitor.py:191 -- Starting autoscaler metrics server on port 44217
2023-10-08 23:13:33,491 INFO monitor.py:216 -- Monitor: Started
2023-10-08 23:13:33,506 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: []
2023-10-08 23:13:33,507 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['216.48.179.215', '164.52.201.70']
2023-10-08 23:13:33,507 INFO autoscaler.py:274 -- disable_node_updaters:False
2023-10-08 23:13:33,507 INFO autoscaler.py:282 -- disable_launch_config_check:False
2023-10-08 23:13:33,507 INFO autoscaler.py:294 -- foreground_node_launch:False
2023-10-08 23:13:33,507 INFO autoscaler.py:304 -- worker_liveness_check:True
2023-10-08 23:13:33,507 INFO autoscaler.py:312 -- worker_rpc_drain:True
2023-10-08 23:13:33,508 INFO autoscaler.py:362 -- StandardAutoscaler: {'cluster_name': 'default', 'auth': {'ssh_user': 'user', 'ssh_private_key': '~/ray_bootstrap_key.pem'}, 'upscaling_speed': 1.0, 'idle_timeout_minutes': 30, 'docker': {'image': 'rayproject/ray:2.7.1.9f07c1-py311-gpu', 'worker_image': 'rayproject/ray:2.7.1.9f07c1-py311-gpu', 'container_name': 'ray_container', 'pull_before_run': True, 'run_options': ['--ulimit nofile=65536:65536']}, 'initialization_commands': [], 'setup_commands': ['sudo apt-get update', 'sudo apt-get install gcc ffmpeg libsm6 libxext6  -y', 'pip install -r "/app/requirements-gpu.txt"'], 'head_setup_commands': ['sudo apt-get update', 'sudo apt-get install gcc ffmpeg libsm6 libxext6  -y', 'pip install -r "/app/requirements-gpu.txt"'], 'worker_setup_commands': ['sudo apt-get update', 'sudo apt-get install gcc ffmpeg libsm6 libxext6  -y', 'pip install -r "/app/requirements-gpu.txt"'], 'head_start_ray_commands': ['ray stop', 'ulimit -c unlimited && export RAY_health_check_timeout_ms=30000 && ray start --head --node-ip-address=164.52.201.70 --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 --disable-usage-stats --log-color=auto -v'], 'worker_start_ray_commands': ['ray stop', 'ray start --address=164.52.201.70:6379 --object-manager-port=8076'], 'file_mounts': {'~/.ssh/id_rsa': '/home/ray/.ssh/id_rsa', '/app/requirements-gpu.txt': '/app/requirements-gpu.txt'}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'rsync_exclude': ['**/.git', '**/.git/**'], 'rsync_filter': ['.gitignore'], 'provider': {'type': 'local', 'head_ip': '164.52.201.70', 'worker_ips': ['216.48.179.215']}, 'available_node_types': {'local.cluster.node': {'node_config': {}, 'resources': {}, 'min_workers': 1, 'max_workers': 1}}, 'head_node_type': 'local.cluster.node', 'max_workers': 1, 'no_restart': False}
2023-10-08 23:13:33,509 INFO monitor.py:385 -- Autoscaler has not yet received load metrics. Waiting.
2023-10-08 23:13:38,522 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-10-08 23:13:38,522 INFO autoscaler.py:421 -- 
======== Autoscaler status: 2023-10-08 23:13:38.522726 ========
Node status
---------------------------------------------------------------
Healthy:
 1 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/12.0 CPU
 0.0/1.0 GPU
 0B/28.57GiB memory
 0B/14.29GiB object_store_memory

Demands:
 (no resource demands)
2023-10-08 23:13:38,524 INFO autoscaler.py:1379 -- StandardAutoscaler: Queue 1 new nodes for launch
2023-10-08 23:13:38,524 INFO autoscaler.py:464 -- The autoscaler took 0.002 seconds to complete the update iteration.
2023-10-08 23:13:38,524 INFO node_launcher.py:177 -- NodeLauncher0: Got 1 nodes to launch.
2023-10-08 23:13:38,525 INFO monitor.py:415 -- :event_summary:Resized to 12 CPUs, 1 GPUs.
2023-10-08 23:13:38,526 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['216.48.179.215', '164.52.201.70']
2023-10-08 23:13:38,526 INFO node_launcher.py:177 -- NodeLauncher0: Launching 1 nodes, type local.cluster.node.
2023-10-08 23:13:43,534 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-10-08 23:13:43,534 INFO autoscaler.py:421 -- 
======== Autoscaler status: 2023-10-08 23:13:43.534774 ========
Node status
---------------------------------------------------------------
Healthy:
 1 local.cluster.node
Pending:
 216.48.179.215: local.cluster.node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/12.0 CPU
 0.0/1.0 GPU
 0B/28.57GiB memory
 0B/14.29GiB object_store_memory

Demands:
 (no resource demands)
2023-10-08 23:13:43,537 INFO autoscaler.py:1326 -- Creating new (spawn_updater) updater thread for node 216.48.179.215.`
ajaichemmanam commented 1 year ago

The above log is for 2023-10-08 23:13:33,485 INFO monitor.py:691 -- Ray version: 2.7.1 2023-10-08 23:13:33,485 INFO monitor.py:692 -- Ray commit: 9f07c12615958c3af3760604f6dcacc4b3758a47

jmakov commented 1 year ago

This issue is still present in ray 2.7.1

ajaichemmanam commented 1 year ago

Let us know if any other details are required

architkulkarni commented 1 year ago

Actually, when I reproduced the issue earlier, I had forgotten to open all the ports. After opening all ports, I wasn't able to reproduce the issue.

@jmakov or @ajaichemmanam if you're able to reproduce the issue and you have time, it would potentially be very helpful if you could amend your YAML file as follows:

worker_start_ray_commands:
    - ray stop
    - "echo \"Executing: ray start --address=$RAY_HEAD_IP:6379\" >> ray_worker_output.txt"
    - ray start --address=$RAY_HEAD_IP:6379 >> ray_worker_output.txt 2>&1

And share the ray_worker_output.txt from the failing worker nodes. (Or do modify the commands in any way you see fit, as long as we can see the output of ray start --address=...)

jmakov commented 1 year ago

@architkulkarni I've added ulimit -c unlimited && ray start --address=$RAY_HEAD_IP:6379 --disable-usage-stats >> /tmp/ray_worker_output.txt 2>&1 and get

ls /tmp/ray_worker_output.txt
ls: cannot access '/tmp/ray_worker_output.txt': No such file or directory
architkulkarni commented 1 year ago

@jmakov Thanks! I think this means the command was never run. I don't want to take up too much of your time with the back-and-forth here, but one thing that might help confirm this and narrow things down is if we add something like "echo setup_command was run >> /tmp/ray_worker_output.txt" as the first item in setup_commands.

Another mystery is why the worker node 192.168.0.108 was able to join in your monitor.log above, but not the other worker nodes.

ajaichemmanam commented 1 year ago

Me too tried. However as I said before, the docker in worker has not even getting started /running. So even putting "echo setup_command was run >> /tmp/ray_worker_output.txt" as the first item in setup_commands of workers does't work.

2023-10-20 11:56:17,855 INFO monitor.py:385 -- Autoscaler has not yet received load metrics. Waiting. 2023-10-20 11:56:22,867 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes. 2023-10-20 11:56:22,868 INFO autoscaler.py:421 -- ======== Autoscaler status: 2023-10-20 11:56:22.868239 ======== Node status

Healthy: 1 local.cluster.node Pending: (no pending nodes) Recent failures: (no failures)

Resources

Usage: 0.0/12.0 CPU 0.0/1.0 GPU 0B/28.58GiB memory 0B/14.29GiB object_store_memory

Demands: (no resource demands) 2023-10-20 11:56:22,869 INFO autoscaler.py:1379 -- StandardAutoscaler: Queue 1 new nodes for launch 2023-10-20 11:56:22,869 INFO autoscaler.py:464 -- The autoscaler took 0.002 seconds to complete the update iteration. 2023-10-20 11:56:22,869 INFO node_launcher.py:177 -- NodeLauncher0: Got 1 nodes to launch.

==> /tmp/ray/session_latest/logs/monitor.out <==

ajaichemmanam commented 1 year ago

https://github.com/ray-project/ray/issues/38718

This might be a related issue

jmakov commented 1 year ago

@architkulkarni "Another mystery is why the worker node 192.168.0.108 was able to join in your monitor.log" If it hels, I start ray up cluster.yaml from 192.168.0.110. 192.168.0.101 is the head node (which can SSH into all other nodes). And I don't use any firewalls.

architkulkarni commented 1 year ago

@architkulkarni "Another mystery is why the worker node 192.168.0.108 was able to join in your monitor.log" If it hels, I start ray up cluster.yaml from 192.168.0.110. 192.168.0.108 is the head node (which can SSH into all other nodes). And I don't use any firewalls.

Oh interesting, but in your monitor.log it says two nodes have successfully joined (I think one head and one worker) and the yaml has 108 as a worker node:

    head_ip: 192.168.0.101
    worker_ips:
      - 192.168.0.106
      - 192.168.0.107
      - 192.168.0.108
      - 192.168.0.110

But maybe it's a different run.

@ajaichemmanam thanks for the additional details, it should be helpful for trying to reproduce the issue on our end. How were you able to determine that the docker container didn't start?

jmakov commented 1 year ago

Yes, my mistake, 101 is the head node (have updated my prev comment).

ajaichemmanam commented 1 year ago

I logged into the worker system via ssh and checked if any containers are running via the command docker ps -a. And couldn't find any related running images on the worker node

jmakov commented 1 year ago

Perhaps a note - I'm not running containers but in conda venv on nodes directly.

ajaichemmanam commented 12 months ago

Any update on this?

architkulkarni commented 11 months ago

@ajaichemmanam we haven't been able to reproduce this issue unfortunately. Let us know if there is a minimal configuration that works for you, and we can try to narrow down what's causing the issue.

ajaichemmanam commented 11 months ago

` cluster_name: default

docker: image: rayproject/ray:2.8.0-py311-gpu

head_image: str

worker_image: rayproject/ray:2.8.0-py311
container_name: "ray_container"
pull_before_run: True
run_options:   # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536

provider: type: local head_ip: 164.52.204.242 worker_ips: [216.48.179.215] auth: ssh_user: user ssh_private_key: ~/.ssh/id_rsa

upscaling_speed: 1.0 idle_timeout_minutes: 30

file_mounts: { "/app/requirements.txt": "/Users/ajaichemmanam/Downloads/ray/requirements.txt", "~/.ssh/id_rsa": "/Users/ajaichemmanam/.ssh/id_rsa",

some more file mounts

}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:

rsync_filter:

initialization_commands: []

setup_commands:

head_setup_commands: []

worker_setup_commands:

head_start_ray_commands:

worker_start_ray_commands:

ajaichemmanam commented 11 months ago

The 2 systems are cloud Instances with all ports open (Since No information was available on what all ports needs to be exposed for ray to commuicate). Head starts are works as expected. Workers sometimes gets connected, sometimes it doesn't.

ajaichemmanam commented 11 months ago

@architkulkarni I was obsering the behaviour further for the past 1 week. It seems that ray is not properly stopped in worker and exited the container while we do 'ray down config.yaml'

In that case, when we do ray up the second time, the Node updater gets stuck (showing uninitialized state in ray monitor logs and launching state in ray status & dashboard).

  1. Do Ray up config.yaml
  2. Worker nodes connects to head node.
  3. Before doing ray down config.yaml, login to the worker container, do ray stop manually. Then exit the container.
  4. Do ray down.config.yaml. Head node shuts down.
  5. Do ray up again, worker will get connected to the head node properly.

If we don't do Step 3, next time the workers won't get connected and gets stuck in launching & uninitialized state

I think we need to have a feature like 'worker_stop_commands' similar to 'worker_start_commands' in config.yaml. Which helps to properly shutdown the nodes / cleanup the nodes before it's shutting down. Let me know your thoughts on this.

jmakov commented 11 months ago

I think I had a similar debate with ray devs before - it was about not cleaning up the state after shutdown. The answer was they primarily support containers and in that context once you call ray to shutdown, the containers are simply destroyed (so things don't need to be cleaned up). Which of course causes problems when you're running on prem without containers.

ajaichemmanam commented 11 months ago

In my case, I'm using containers. But for that also, ray seems to be copying the required files to worker containers by mounting it from some /tmp/ path. any redundant files here might also cause issue. There are also chances that the ray process may not exit in time (for which a --force command and a --grace-period flags are also provided by ray in it's stop commands).

If in case the worker ray process failed to exit, the container may not be getting destroyed.

But as @jmakov said, there should be a proper cleanup mechanism whether its inside containers or not.

architkulkarni commented 11 months ago

Thanks for the additional info and the discussion. We'll see if we can reproduce this using your latest steps and try to find the root cause. We should be able to fix it internally without needing to expose a new API worker_stop_commands, but if there are enough use cases for it we can consider adding the new API as well.

rajeshitshoulders commented 11 months ago

Here is my findings: I'm working on the Ray cluster over 2 months in on-perm with BM servers.

First i had issue with version 2.7.0, after lot trail & error, foud upgraded to 2.8.0 to resolve the uninitialized issue for Ray cluster with 1 head and 1 worker node, we tested autoscaler, hyperparameter and ray server every thing fine.

But now we wanted to test mult-node (more than 2 or more woker node). i start cluster with 2 or more node as worker node, hardly spin up cluster in first try with all nodes, once i able to star the cluster with 3 worker node, but raylet was died in one of the worker node, so my point is ray local provider cluster with 2 or more node is not reliable and most of the time worker nodes are "uninitialized".

Env: Cluster: local provider Nodes: 1 head, 3 worker nodes Docker version: rayproject/ray-ml:2.8.1 Network: all 4 servers are in same switch and same subnet.

Cluster Launcher env: Conda ray env: Ray: 2.8.1 Python version: 3.9.18

I also tested the step suggested by @ajachemmanam, doesn't work for me I tweak cluster state file in /tmp/ray with worker node state changed from terminated to "up-to-down", tear down the cluster, then relaunch works some time this way, but with autoscaler don't work this way.

Please let me know if you need any additional information.

rajeshitshoulders commented 10 months ago

@architkulkarni - would you able to provide fix or insight into the issue? we are still unable to do multi-node training using Ray to conclude our POC

flyingfalling commented 9 months ago

I have this issue as well. I get the same issue where it hangs while initializing the workers. Even the very first setup commands never get run on the workers. I have tried between ray 2.6.0 (didn't try earlier) and the nightly, including the current PIP version.

I have not changed a single thing, I simply downloaded the example-full.yaml file from https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/local/example-full.yaml

I simply added the head and worker IPs. There is no firewall running. Ubuntu 22.04.

The same thing happens if I delete the docker section.

Starting the cluster manually works (i.e. on head running "ray start and port" and on workers running "ray start" with head IP).

flyingfalling commented 9 months ago

The only thing I noticed that is weird is e.g. in the head start command (where it says YOU DO NOT NEED TO CHANGE THIS), an option:

--autoscaling-config=~/ray_bootstrap_config.yaml

It looks like it is assuming there is a file named "ray_bootstrap_config.yaml" in my home directory? This is super weird, why would that file exist and what would it contain? Is this supposed to be changed to point to the autostart yaml like example-full.yaml?

ajaichemmanam commented 9 months ago

Any updates @architkulkarni

ajaichemmanam commented 8 months ago

any update?

jacksonjacobs1 commented 7 months ago

@flyingfalling The ray_bootstrap_config.yaml is the configuration file specified at ray up. When you run ray up, the cluster-launcher rsyncs that file to the head and worker nodes under the new name. Same thing goes with ray_bootstrap_key.pem

jacksonjacobs1 commented 7 months ago

Would love to see an update on this soon!

@ajaichemmanam 's workaround also worked for me.

nilsmelchert commented 7 months ago

Same issue here. Any updates on this?

Would love to see this fixed. This makes it almost impossible to use ray on premise without k8.

MatteoCorvi commented 5 months ago

Worker nodes almost always stuck as launching/uninitialized or no cluster status at all. Only way a recent version (2.22) seems to be working for me is using a conda env with an old version of ray (2.3) and pip install -U ray==2.22. 100% success creating a working cluster on prem so far. New dashboard and logging, plus the cluster seems more stable so I assume improvements of new versions went through.

jacksonjacobs1 commented 5 months ago

Hi @MatteoCorvi,

Glad to hear you were able to get this working, but I'm a little confused about your solution. How is this different from simply installing ray version 2.22?

MatteoCorvi commented 5 months ago

Hi @jacksonjacobs1, not sure but aside ray not much else was changed if I recall, so just updating might have kept old versions of the dependencies that don't cause issues.

jacksonjacobs1 commented 5 months ago

Interesting, thanks.

It would be fantastic if a Ray dev from the cluster team could comment on why newer versions of ray seem to break on-prem cluster launching & cleanup.

@anyscalesam What would be your recommendation for resolving this issue?

Tipmethewink commented 3 months ago

I'm running ray on AWS EC2 instances with the same issue. ray up... launches the head node though there's no further logs (no logging about setting up nodes) and the head node sits in uninitialized status, eventually ray up times out and everything shuts down. If I commented out file_mounts then the cluster came up fine. Which led me to realise ray doesn't use rsync over ssh (my assumption), it's using the default 873 port which I hadn't opened (it's not documented here). As soon as I opened 873, it all sprang to life.

jacksonjacobs1 commented 3 months ago

Hi @Tipmethewink, are you using existing EC2 instances (equivalent to an on-prem cluster) or using ray cluster launcher to provision new EC2 instances?

Tipmethewink commented 3 months ago

I'm using the cluster launcher: ray up cluster.yaml.

On Tue, 23 Jul 2024, 18:07 Jackson Jacobs, @.***> wrote:

Hi @Tipmethewink https://github.com/Tipmethewink, are you using existing EC2 instances (equivalent to an on-prem cluster) or using ray cluster launcher to provision new EC2 instances?

— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/39565#issuecomment-2245782298, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALGU36MYJZKFLJXQ7ZUIK3TZN2ETVAVCNFSM6AAAAAA4UBMZY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBVG44DEMRZHA . You are receiving this because you were mentioned.Message ID: @.***>

jyakaranda commented 3 months ago

I got this same painful issue today, after retriving the codes and logs from ray dashboard, I made my worker node started finally. I'm not sure if this would solve your guys problem, I'm still want to share my debugging process.

  1. If you could start head node via ray up cluster.yaml, check out the monitor.log and monitor.out in dashboard at http://127.0.0.1:8265/#/logs (forwarded by ray dashboard cluster.yaml), sometimes these logs would tell you whether the worker node is starting or hanging. And in my case, the head node is hanging on simple ssh issue;

image

  1. ssh hanging issue is tricky. in my case, it's due to ray is using same auth for all head node and worker nodes, but I didn't create a same user in worker node as header node. After create the same user on worker node and uncomment ssh_private_key, the worker node could finally be sshed and started from header node.

  2. like former comment mentions, if the worker node didn't stop the container properly, the header node still could not start worker node properly too, so you might need to docker stop RAY_CONTAINER_NAME manually before ray up.

hopes these findings could help you

olly-writes-code commented 3 weeks ago

Hey folks, I ran into a similar issue when trying to set up an "On Prem" 1 click cluster via Lambda Labs.

I could start the cluster successfully when not using a docker image. But as soon as I switched to the docker image, I ran into the uninitialized issue.

I would get something like

poetry run ray status
======== Autoscaler status: 2024-10-16 21:28:29.027359 ========
Node status
---------------------------------------------------------------
Active:
 1 local.cluster.node
Pending:
 scrubbed_ip: local.cluster.node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0B/18.61GiB memory
 0B/9.31GiB object_store_memory

Demands:
 (no resource demands)

Here's the config.yaml I was using.

cluster_name: test-cluster

upscaling_speed: 1.0

docker:
  container_name: basic-ray-ml-image
  image: rayproject/ray-ml:latest-gpu
  pull_before_run: true

provider:
 type: local
 head_ip: scrubbed_ip
 worker_ips:
  - scrubbed_ip

auth:
 ssh_user: ubuntu
 ssh_private_key: ~/.ssh/keypair

min_workers: 1
max_workers: 1

setup_commands:
 - pip install ray[default]

head_start_ray_commands:
 - ray stop
 - ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

worker_start_ray_commands:
 - ray stop
 - ray start --address=$RAY_HEAD_IP:6379

I managed to fix this by

  1. Manually rebooting the node that wouldn't initialize.
  2. I noticed when SSHing into that node that docker ps would return permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock. I fixed this by running sudo usermod -aG docker $USER, exiting the machine and then SSH'ing in again. This might be a lambda labs thing.
  3. Re-running ray up from the head node

Maybe this helps some people!

I feel like this stems from poor logging / error reporting from the other nodes.

olly-writes-code commented 3 weeks ago

Additionally I don't see any logging or log file.

Even though the instruction from poetry run ray monitor my_cluster.yaml is to find logs at

==> /tmp/ray/session_latest/logs/monitor.out <==

I don't see such a file on any of the nodes

cat /tmp/ray/session_latest/logs/monitor.out
cat: /tmp/ray/session_latest/logs/monitor.out: No such file or directory