Open shuiqingliu opened 1 month ago
Could any official personnel help investigate this issue? If everyone is too busy to look into it deeply, could you please provide me with the investigation approach for such issues? I will try to debug it myself.
Hi @shuiqingliu ,
Have you tried to just run ray up
once and wait for a while and then run ray status
and see if worker nodes can be created. As the log suggests, it might take a few seconds for the Ray internal services to start.
Hi @shuiqingliu ,
Have you tried to just run
ray up
once and wait for a while and then runray status
and see if worker nodes can be created. As the log suggests, it might take a few seconds for the Ray internal services to start.
Yes, I waited for a few minutes, but it was still in no cluster state.
The same problem has occurred again. I've been waiting for 10 minutes.
I have been experiencing a similar issue here. Also, when I use ray up
with a config with multiple worker nodes configured. I can tell ray up
or the "cluster launcher" is not properly preparing the worker node somehow...
What happened + What you expected to happen
Bug When starting a Ray cluster with one head node and three worker nodes, ray status keeps indicating "no cluster". The startup steps have to be repeated several times until successful, but it is unclear how many restarts are needed to achieve success.
As illustrated in the image below, I kept executing ray up and it eventually started. However, we did not modify anything.
Expectation Successfully deploy the cluster on the first attempt.
Useful info Observations: 1.The Ray head node Docker container can always start properly. The ray start command inside the container also executes correctly, since ray status can be run. 2.ray status shows that worker nodes remain in the "launching" state or indicate that there is no Ray cluster.
Versions / Dependencies
ray 2.23.0
Reproduction script
rayproject/ray-ml
this is my config ofray up
command, and run asray up --no-config-cache -y -v ./example-full.yaml
The IP addresses are hidden, but the original IPs are all public and can mutually access each other via SSH.Running Ray in Docker images is optional (this docker section can be commented out).
This executes all commands on all nodes in the docker container,
and opens all the necessary ports to support the Ray cluster.
Empty string means disabled. Assumes Docker is installed.
docker: image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
image: "rayproject/ray:latest-py39-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
provider: type: local head_ip: x.x.x.1
You may need to supply a public ip for the head node if you need
How Ray will authenticate with newly launched nodes.
auth: ssh_user: xxxx
You can comment out
ssh_private_key
if the following machines don't need a private key for SSH access to the RayThe minimum number of workers nodes to launch in addition to the head
node. This number should be >= 0.
Typically, min_workers == max_workers == len(worker_ips).
This field is optional.
min_workers: 3
The maximum number of workers nodes to launch in addition to the head node.
This takes precedence over min_workers.
Typically, min_workers == max_workers == len(worker_ips).
This field is optional.
max_workers: 3
The default behavior for manually managed clusters is
min_workers == max_workers == len(worker_ips),
meaning that Ray is started on all available nodes of the cluster.
For automatically managed clusters, max_workers is required and min_workers defaults to 0.
The autoscaler will scale up the cluster faster with higher upscaling speed.
E.g., if the task requires adding more nodes then autoscaler will gradually
scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
This number should be > 0.
upscaling_speed: 10.0
idle_timeout_minutes: 5
Files or directories to copy to the head and worker nodes. The format is a
dictionary from REMOTE_PATH: LOCAL_PATH. E.g. you could save your conda env to an environment.yaml file, mount
that directory to all nodes and call
conda -n my_env -f /path1/on/remote/machine/environment.yaml
. In thisexample paths on all nodes must be the same (so that conda can be called always with the same argument)
file_mounts: {
"~/.ssh/": "/Users/qingliu/PycharmProjects/pdftrans/src/surya/ssh",
"/path2/on/remote/machine": "/path2/on/local/machine",
}
Files or directories to copy from the head node to the worker nodes. The format is a
list of paths. The same path on the head node will be copied to the worker node.
This behavior is a subset of the file_mounts behavior. In the vast majority of cases
you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
Whether changes to directories in file_mounts or cluster_synced_files in the head node
should sync to the worker node continuously
file_mounts_sync_continuously: False
Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
List of commands that will be run before
setup_commands
. If docker isenabled, these commands will run outside the container and before docker
is setup.
initialization_commands: []
List of shell commands to run to set up each nodes.
setup_commands: []
If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
Custom commands that will be run on the head node after common setup.
head_setup_commands: []
Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
If we have e.g. conda dependencies, we could create on each node a conda environment (see
setup_commands
section).In that case we'd have to activate that env on each node before running
ray
:- conda activate my_venv && ray stop
- conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
If we have e.g. conda dependencies, we could create on each node a conda environment (see
setup_commands
section).In that case we'd have to activate that env on each node before running
ray
:- conda activate my_venv && ray stop
- ray start --address=$RAY_HEAD_IP:6379