ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
31.91k stars 5.44k forks source link

[Ray Cluster] Ray cluster stuck after using ray up #38718

Open JacksonCakes opened 10 months ago

JacksonCakes commented 10 months ago

What happened + What you expected to happen

When I run ray up -y cluster_config.yml --no-config-cache, it seems like the cluster has successfully setup based on output below.

Cluster: default

Checking Local environment settings
2023-08-22 17:25:21,128 INFO node_provider.py:55 -- ClusterState: Loaded cluster state: ['192.168.1.76', '192.168.1.77']
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y

Usage stats collection is disabled.

<1/1> Setting up head node
  Prepared bootstrap config
2023-08-22 17:25:23,391 INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['192.168.1.76', '192.168.1.77']
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 192.168.1.77
Warning: Permanently added '192.168.1.77' (ED25519) to the list of known hosts.
 17:25:24 up 23:57,  3 users,  load average: 0.26, 0.48, 0.77
Shared connection to 192.168.1.77 closed.
    Success.
  [2-6/7] Configuration already up to date, skipping file mounts, initalization and setup commands.
  [7/7] Starting the Ray runtime
Did not find any active Ray processes.
Usage stats collection is disabled.

Local node IP: 192.168.1.77

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='192.168.1.77:4896'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://192.168.1.77:8265' ray job submit --working-dir . -- python my_script.py

  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
  for more information on submitting Ray jobs to the Ray cluster.

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  To monitor and debug Ray, view the dashboard at 
    192.168.1.77:8265

  If connection to the dashboard fails, check your firewall settings and network configuration.
Shared connection to 192.168.1.77 closed.
2023-08-22 17:25:32,144 INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['192.168.1.76', '192.168.1.77']
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec /home/jackson/MASS_SEARCH_PROD/cluster_conf.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach /home/jackson/MASS_SEARCH_PROD/cluster_conf.yaml
  Get a remote shell to the cluster manually:
    ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa jackson@192.168.1.77

but then after waiting for more than 10 minutes, it still did not managed to start. I checked with ray status but get the following:

No cluster status. It may take a few seconds for the Ray internal services to start up.

using ray monitor cluster_config.yaml are stuck too.

I am not sure if there is problem with my config file below, but I managed to run successfully using the same script below if I delete /tmp/ray/cluster-default.state and re-run ray up.

Versions / Dependencies

python==3.8.15 ray==2.6.1

Reproduction script

# A unique identifier for the head node and workers of this cluster.
cluster_name: default

provider:
    type: local
    head_ip: 192.168.1.77
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: [192.168.1.76]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: jackson
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    ssh_private_key: ~/.ssh/id_rsa

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
min_workers: 1

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
max_workers: 1
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH. E.g. you could save your conda env to an environment.yaml file, mount
# that directory to all nodes and call `conda -n my_env -f /path1/on/remote/machine/environment.yaml`. In this
# example paths on all nodes must be the same (so that conda can be called always with the same argument)
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: True

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands: 
    - cd Next_Word_Autocomplete && source autocomplete_ray/bin/activate && pip install -r requirements.txt
    # If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
    # work environment on each worker by:
    #   1. making sure each worker has access to this file i.e. see the `file_mounts` section
    #   2. adding a command here that creates a new conda environment on each node or if the environment already exists,
    #     it updates it:
    #      conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
    #
    # Ray developers:
    # you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: [] 

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - cd Next_Word_Autocomplete && source autocomplete_ray/bin/activate && ulimit -c unlimited && ray stop && ray start --head --port=4896 --dashboard-host 192.168.1.77 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - cd Next_Word_Autocomplete && source autocomplete_ray/bin/activate && ray stop && ray start --address=$RAY_HEAD_IP:4896

Issue Severity

High: It blocks me from completing my task.

jmakov commented 9 months ago

My issue might be related: https://github.com/ray-project/ray/issues/39565

architkulkarni commented 8 months ago

@JacksonCakes does the issue happen consistently, and does it happen with Ray 2.7.1? If if happens again, can you send a zip of all the logs from the head node (/tmp/ray/session_latest/logs)?

jaanphare commented 4 months ago

Also having this issue.

ajaichemmanam commented 3 months ago

any update?

jaanphare commented 3 months ago

I debugged this issue by ssh'ing into the worker nodes, and learning that the default volume size in the cluster launcher is too low for some common docker container images (like nvcr.io/nvidia/pytorch:24.02-py3, with a few additional dependencies).

Similarly, the default timeout might be low, so worker nodes might be initialized and then stuck on an endless loop of being killed after they hit the timeout limit while also running out of memory due to default volume size limits.

shuiqingliu commented 1 month ago

same issue

anyscalesam commented 6 days ago

@JacksonCakes @shuiqingliu @ajaichemmanam did upping the volume size as well as extending the default timeout mitigate the problem?

@jaanphare do you think we should up the defaults a bit? If so what size/timeout secs would you recommend?

shuiqingliu commented 6 days ago

@JacksonCakes @shuiqingliu @ajaichemmanam did upping the volume size as well as extending the default timeout mitigate the problem?

@jaanphare do you think we should up the defaults a bit? If so what size/timeout secs would you recommend?

I will give it a try.