Open jmakov opened 1 year ago
cc @rickyyx can you follow up with the investigation?
type: local
head_ip: 192.168.0.101
# You may need to supply a public ip for the head node if you need
# to run `ray up` from outside of the Ray cluster's network
# (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
# This is useful when debugging the local node provider with cloud VMs.
# external_head_ip: YOUR_HEAD_PUBLIC_IP
worker_ips:
- 192.168.0.106
- 192.168.0.107
- 192.168.0.108
- 192.168.0.110
Can you tell us what this exactly for?
Hey @jmakov - will you be able to get any monitor.*
logs generated? That would be helpful to debug.
Didn't see anything exciting happening there, only monitor.log
has some entries:
Running everything manually works. Would be nice to have a working cluster launcher for on prem clusters.
+1 same issue for me. Even with systems on cloud (3rd party cloud, not AWS/GCS/Azure). Opened all ports, sometimes it gets connected, some times it shows uninitialized.
cc @gvspraveen could someone from the cluster team help take a look? I believe this is more relevant to cluster launcher as of now rather than the actual autoscaling logics since "running everything manually works".
@rickyyx not to mention manually starting ray not working and cluster launcher not working. Wondering how ray works at all for anybody. As someone who uses ray for more than a year, every other release breaks a core part.
cc @anyscalesam can you triage this issue with @gvspraveen?
I'm able to reproduce this on AWS using [see below, it was just a port issue on my end]pip install "ray[default]"==2.7.0
in the setup commands and using the latest ray master on the client side for the cluster launcher.
@jmakov do you happen to remember if this was working for you on a previous version of Ray, and if so which one?
So cluster launcher worked for me for the last +2 years using a local cluster (without Docker, just conda env). Think it was 2.6.0 before I made the mistake of upgrading, if I remember correctly. Think I'll just start writing my own tests and run before every upgrade.
`2023-10-09 11:46:28,208 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['216.48.179.215', '164.52.201.70']
Fetched IP: 164.52.201.70
Warning: Permanently added '164.52.201.70' (ED25519) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
==> /tmp/ray/session_latest/logs/monitor.log <==
2023-10-08 23:13:33,485 INFO monitor.py:690 -- Starting monitor using ray installation: /home/ray/anaconda3/lib/python3.11/site-packages/ray/__init__.py
2023-10-08 23:13:33,485 INFO monitor.py:691 -- Ray version: 2.7.1
2023-10-08 23:13:33,485 INFO monitor.py:692 -- Ray commit: 9f07c12615958c3af3760604f6dcacc4b3758a47
2023-10-08 23:13:33,486 INFO monitor.py:693 -- Monitor started with command: ['/home/ray/anaconda3/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2023-10-08_23-13-32_012785_2484/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=164.52.201.70:6379', '--autoscaling-config=/home/ray/ray_bootstrap_config.yaml', '--monitor-ip=164.52.201.70']
2023-10-08 23:13:33,489 INFO monitor.py:159 -- session_name: session_2023-10-08_23-13-32_012785_2484
2023-10-08 23:13:33,490 INFO monitor.py:191 -- Starting autoscaler metrics server on port 44217
2023-10-08 23:13:33,491 INFO monitor.py:216 -- Monitor: Started
2023-10-08 23:13:33,506 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: []
2023-10-08 23:13:33,507 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['216.48.179.215', '164.52.201.70']
2023-10-08 23:13:33,507 INFO autoscaler.py:274 -- disable_node_updaters:False
2023-10-08 23:13:33,507 INFO autoscaler.py:282 -- disable_launch_config_check:False
2023-10-08 23:13:33,507 INFO autoscaler.py:294 -- foreground_node_launch:False
2023-10-08 23:13:33,507 INFO autoscaler.py:304 -- worker_liveness_check:True
2023-10-08 23:13:33,507 INFO autoscaler.py:312 -- worker_rpc_drain:True
2023-10-08 23:13:33,508 INFO autoscaler.py:362 -- StandardAutoscaler: {'cluster_name': 'default', 'auth': {'ssh_user': 'user', 'ssh_private_key': '~/ray_bootstrap_key.pem'}, 'upscaling_speed': 1.0, 'idle_timeout_minutes': 30, 'docker': {'image': 'rayproject/ray:2.7.1.9f07c1-py311-gpu', 'worker_image': 'rayproject/ray:2.7.1.9f07c1-py311-gpu', 'container_name': 'ray_container', 'pull_before_run': True, 'run_options': ['--ulimit nofile=65536:65536']}, 'initialization_commands': [], 'setup_commands': ['sudo apt-get update', 'sudo apt-get install gcc ffmpeg libsm6 libxext6 -y', 'pip install -r "/app/requirements-gpu.txt"'], 'head_setup_commands': ['sudo apt-get update', 'sudo apt-get install gcc ffmpeg libsm6 libxext6 -y', 'pip install -r "/app/requirements-gpu.txt"'], 'worker_setup_commands': ['sudo apt-get update', 'sudo apt-get install gcc ffmpeg libsm6 libxext6 -y', 'pip install -r "/app/requirements-gpu.txt"'], 'head_start_ray_commands': ['ray stop', 'ulimit -c unlimited && export RAY_health_check_timeout_ms=30000 && ray start --head --node-ip-address=164.52.201.70 --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 --disable-usage-stats --log-color=auto -v'], 'worker_start_ray_commands': ['ray stop', 'ray start --address=164.52.201.70:6379 --object-manager-port=8076'], 'file_mounts': {'~/.ssh/id_rsa': '/home/ray/.ssh/id_rsa', '/app/requirements-gpu.txt': '/app/requirements-gpu.txt'}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'rsync_exclude': ['**/.git', '**/.git/**'], 'rsync_filter': ['.gitignore'], 'provider': {'type': 'local', 'head_ip': '164.52.201.70', 'worker_ips': ['216.48.179.215']}, 'available_node_types': {'local.cluster.node': {'node_config': {}, 'resources': {}, 'min_workers': 1, 'max_workers': 1}}, 'head_node_type': 'local.cluster.node', 'max_workers': 1, 'no_restart': False}
2023-10-08 23:13:33,509 INFO monitor.py:385 -- Autoscaler has not yet received load metrics. Waiting.
2023-10-08 23:13:38,522 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-10-08 23:13:38,522 INFO autoscaler.py:421 --
======== Autoscaler status: 2023-10-08 23:13:38.522726 ========
Node status
---------------------------------------------------------------
Healthy:
1 local.cluster.node
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/12.0 CPU
0.0/1.0 GPU
0B/28.57GiB memory
0B/14.29GiB object_store_memory
Demands:
(no resource demands)
2023-10-08 23:13:38,524 INFO autoscaler.py:1379 -- StandardAutoscaler: Queue 1 new nodes for launch
2023-10-08 23:13:38,524 INFO autoscaler.py:464 -- The autoscaler took 0.002 seconds to complete the update iteration.
2023-10-08 23:13:38,524 INFO node_launcher.py:177 -- NodeLauncher0: Got 1 nodes to launch.
2023-10-08 23:13:38,525 INFO monitor.py:415 -- :event_summary:Resized to 12 CPUs, 1 GPUs.
2023-10-08 23:13:38,526 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['216.48.179.215', '164.52.201.70']
2023-10-08 23:13:38,526 INFO node_launcher.py:177 -- NodeLauncher0: Launching 1 nodes, type local.cluster.node.
2023-10-08 23:13:43,534 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-10-08 23:13:43,534 INFO autoscaler.py:421 --
======== Autoscaler status: 2023-10-08 23:13:43.534774 ========
Node status
---------------------------------------------------------------
Healthy:
1 local.cluster.node
Pending:
216.48.179.215: local.cluster.node, uninitialized
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/12.0 CPU
0.0/1.0 GPU
0B/28.57GiB memory
0B/14.29GiB object_store_memory
Demands:
(no resource demands)
2023-10-08 23:13:43,537 INFO autoscaler.py:1326 -- Creating new (spawn_updater) updater thread for node 216.48.179.215.`
The above log is for 2023-10-08 23:13:33,485 INFO monitor.py:691 -- Ray version: 2.7.1 2023-10-08 23:13:33,485 INFO monitor.py:692 -- Ray commit: 9f07c12615958c3af3760604f6dcacc4b3758a47
This issue is still present in ray 2.7.1
Let us know if any other details are required
Actually, when I reproduced the issue earlier, I had forgotten to open all the ports. After opening all ports, I wasn't able to reproduce the issue.
@jmakov or @ajaichemmanam if you're able to reproduce the issue and you have time, it would potentially be very helpful if you could amend your YAML file as follows:
worker_start_ray_commands:
- ray stop
- "echo \"Executing: ray start --address=$RAY_HEAD_IP:6379\" >> ray_worker_output.txt"
- ray start --address=$RAY_HEAD_IP:6379 >> ray_worker_output.txt 2>&1
And share the ray_worker_output.txt
from the failing worker nodes. (Or do modify the commands in any way you see fit, as long as we can see the output of ray start --address=...
)
@architkulkarni I've added ulimit -c unlimited && ray start --address=$RAY_HEAD_IP:6379 --disable-usage-stats >> /tmp/ray_worker_output.txt 2>&1
and get
ls /tmp/ray_worker_output.txt
ls: cannot access '/tmp/ray_worker_output.txt': No such file or directory
@jmakov Thanks! I think this means the command was never run. I don't want to take up too much of your time with the back-and-forth here, but one thing that might help confirm this and narrow things down is if we add something like "echo setup_command was run >> /tmp/ray_worker_output.txt"
as the first item in setup_commands
.
Another mystery is why the worker node 192.168.0.108
was able to join in your monitor.log
above, but not the other worker nodes.
Me too tried. However as I said before, the docker in worker has not even getting started /running. So even putting "echo setup_command was run >> /tmp/ray_worker_output.txt" as the first item in setup_commands of workers does't work.
Healthy: 1 local.cluster.node Pending: (no pending nodes) Recent failures: (no failures)
Usage: 0.0/12.0 CPU 0.0/1.0 GPU 0B/28.58GiB memory 0B/14.29GiB object_store_memory
Demands: (no resource demands) 2023-10-20 11:56:22,869 INFO autoscaler.py:1379 -- StandardAutoscaler: Queue 1 new nodes for launch 2023-10-20 11:56:22,869 INFO autoscaler.py:464 -- The autoscaler took 0.002 seconds to complete the update iteration. 2023-10-20 11:56:22,869 INFO node_launcher.py:177 -- NodeLauncher0: Got 1 nodes to launch.
==> /tmp/ray/session_latest/logs/monitor.out <==
https://github.com/ray-project/ray/issues/38718
This might be a related issue
@architkulkarni "Another mystery is why the worker node 192.168.0.108 was able to join in your monitor.log"
If it hels, I start ray up cluster.yaml
from 192.168.0.110. 192.168.0.101 is the head node (which can SSH into all other nodes). And I don't use any firewalls.
@architkulkarni "Another mystery is why the worker node 192.168.0.108 was able to join in your monitor.log" If it hels, I start
ray up cluster.yaml
from 192.168.0.110. 192.168.0.108 is the head node (which can SSH into all other nodes). And I don't use any firewalls.
Oh interesting, but in your monitor.log it says two nodes have successfully joined (I think one head and one worker) and the yaml has 108 as a worker node:
head_ip: 192.168.0.101
worker_ips:
- 192.168.0.106
- 192.168.0.107
- 192.168.0.108
- 192.168.0.110
But maybe it's a different run.
@ajaichemmanam thanks for the additional details, it should be helpful for trying to reproduce the issue on our end. How were you able to determine that the docker container didn't start?
Yes, my mistake, 101 is the head node (have updated my prev comment).
I logged into the worker system via ssh and checked if any containers are running via the command docker ps -a. And couldn't find any related running images on the worker node
Perhaps a note - I'm not running containers but in conda venv on nodes directly.
Any update on this?
@ajaichemmanam we haven't been able to reproduce this issue unfortunately. Let us know if there is a minimal configuration that works for you, and we can try to narrow down what's causing the issue.
` cluster_name: default
docker: image: rayproject/ray:2.8.0-py311-gpu
worker_image: rayproject/ray:2.8.0-py311
container_name: "ray_container"
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
provider: type: local head_ip: 164.52.204.242 worker_ips: [216.48.179.215] auth: ssh_user: user ssh_private_key: ~/.ssh/id_rsa
upscaling_speed: 1.0 idle_timeout_minutes: 30
file_mounts: { "/app/requirements.txt": "/Users/ajaichemmanam/Downloads/ray/requirements.txt", "~/.ssh/id_rsa": "/Users/ajaichemmanam/.ssh/id_rsa",
}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
rsync_filter:
initialization_commands: []
setup_commands:
head_setup_commands: []
worker_setup_commands:
head_start_ray_commands:
worker_start_ray_commands:
The 2 systems are cloud Instances with all ports open (Since No information was available on what all ports needs to be exposed for ray to commuicate). Head starts are works as expected. Workers sometimes gets connected, sometimes it doesn't.
@architkulkarni I was obsering the behaviour further for the past 1 week. It seems that ray is not properly stopped in worker and exited the container while we do 'ray down config.yaml'
In that case, when we do ray up the second time, the Node updater gets stuck (showing uninitialized state in ray monitor logs and launching state in ray status & dashboard).
If we don't do Step 3, next time the workers won't get connected and gets stuck in launching & uninitialized state
I think we need to have a feature like 'worker_stop_commands' similar to 'worker_start_commands' in config.yaml. Which helps to properly shutdown the nodes / cleanup the nodes before it's shutting down. Let me know your thoughts on this.
I think I had a similar debate with ray devs before - it was about not cleaning up the state after shutdown. The answer was they primarily support containers and in that context once you call ray to shutdown, the containers are simply destroyed (so things don't need to be cleaned up). Which of course causes problems when you're running on prem without containers.
In my case, I'm using containers. But for that also, ray seems to be copying the required files to worker containers by mounting it from some /tmp/ path. any redundant files here might also cause issue. There are also chances that the ray process may not exit in time (for which a --force command and a --grace-period flags are also provided by ray in it's stop commands).
If in case the worker ray process failed to exit, the container may not be getting destroyed.
But as @jmakov said, there should be a proper cleanup mechanism whether its inside containers or not.
Thanks for the additional info and the discussion. We'll see if we can reproduce this using your latest steps and try to find the root cause. We should be able to fix it internally without needing to expose a new API worker_stop_commands
, but if there are enough use cases for it we can consider adding the new API as well.
Here is my findings: I'm working on the Ray cluster over 2 months in on-perm with BM servers.
First i had issue with version 2.7.0, after lot trail & error, foud upgraded to 2.8.0 to resolve the uninitialized issue for Ray cluster with 1 head and 1 worker node, we tested autoscaler, hyperparameter and ray server every thing fine.
But now we wanted to test mult-node (more than 2 or more woker node). i start cluster with 2 or more node as worker node, hardly spin up cluster in first try with all nodes, once i able to star the cluster with 3 worker node, but raylet was died in one of the worker node, so my point is ray local provider cluster with 2 or more node is not reliable and most of the time worker nodes are "uninitialized".
Env: Cluster: local provider Nodes: 1 head, 3 worker nodes Docker version: rayproject/ray-ml:2.8.1 Network: all 4 servers are in same switch and same subnet.
Cluster Launcher env: Conda ray env: Ray: 2.8.1 Python version: 3.9.18
I also tested the step suggested by @ajachemmanam, doesn't work for me I tweak cluster state file in /tmp/ray with worker node state changed from terminated to "up-to-down", tear down the cluster, then relaunch works some time this way, but with autoscaler don't work this way.
Please let me know if you need any additional information.
@architkulkarni - would you able to provide fix or insight into the issue? we are still unable to do multi-node training using Ray to conclude our POC
I have this issue as well. I get the same issue where it hangs while initializing the workers. Even the very first setup commands never get run on the workers. I have tried between ray 2.6.0 (didn't try earlier) and the nightly, including the current PIP version.
I have not changed a single thing, I simply downloaded the example-full.yaml file from https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/local/example-full.yaml
I simply added the head and worker IPs. There is no firewall running. Ubuntu 22.04.
The same thing happens if I delete the docker section.
Starting the cluster manually works (i.e. on head running "ray start and port" and on workers running "ray start" with head IP).
The only thing I noticed that is weird is e.g. in the head start command (where it says YOU DO NOT NEED TO CHANGE THIS), an option:
--autoscaling-config=~/ray_bootstrap_config.yaml
It looks like it is assuming there is a file named "ray_bootstrap_config.yaml" in my home directory? This is super weird, why would that file exist and what would it contain? Is this supposed to be changed to point to the autostart yaml like example-full.yaml?
Any updates @architkulkarni
any update?
@flyingfalling The ray_bootstrap_config.yaml
is the configuration file specified at ray up
. When you run ray up
, the cluster-launcher rsyncs that file to the head and worker nodes under the new name. Same thing goes with ray_bootstrap_key.pem
Would love to see an update on this soon!
@ajaichemmanam 's workaround also worked for me.
Same issue here. Any updates on this?
Would love to see this fixed. This makes it almost impossible to use ray on premise without k8.
Worker nodes almost always stuck as launching/uninitialized or no cluster status at all. Only way a recent version (2.22) seems to be working for me is using a conda env with an old version of ray (2.3) and pip install -U ray==2.22. 100% success creating a working cluster on prem so far. New dashboard and logging, plus the cluster seems more stable so I assume improvements of new versions went through.
Hi @MatteoCorvi,
Glad to hear you were able to get this working, but I'm a little confused about your solution. How is this different from simply installing ray version 2.22?
Hi @jacksonjacobs1, not sure but aside ray not much else was changed if I recall, so just updating might have kept old versions of the dependencies that don't cause issues.
Interesting, thanks.
It would be fantastic if a Ray dev from the cluster team could comment on why newer versions of ray seem to break on-prem cluster launching & cleanup.
@anyscalesam What would be your recommendation for resolving this issue?
I'm running ray on AWS EC2 instances with the same issue. ray up...
launches the head node though there's no further logs (no logging about setting up nodes) and the head node sits in uninitialized
status, eventually ray up
times out and everything shuts down. If I commented out file_mounts
then the cluster came up fine. Which led me to realise ray
doesn't use rsync
over ssh
(my assumption), it's using the default 873 port which I hadn't opened (it's not documented here). As soon as I opened 873, it all sprang to life.
Hi @Tipmethewink, are you using existing EC2 instances (equivalent to an on-prem cluster) or using ray cluster launcher to provision new EC2 instances?
I'm using the cluster launcher: ray up cluster.yaml.
On Tue, 23 Jul 2024, 18:07 Jackson Jacobs, @.***> wrote:
Hi @Tipmethewink https://github.com/Tipmethewink, are you using existing EC2 instances (equivalent to an on-prem cluster) or using ray cluster launcher to provision new EC2 instances?
— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/39565#issuecomment-2245782298, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALGU36MYJZKFLJXQ7ZUIK3TZN2ETVAVCNFSM6AAAAAA4UBMZY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBVG44DEMRZHA . You are receiving this because you were mentioned.Message ID: @.***>
I got this same painful issue today, after retriving the codes and logs from ray dashboard, I made my worker node started finally. I'm not sure if this would solve your guys problem, I'm still want to share my debugging process.
ray up cluster.yaml
, check out the monitor.log and monitor.out in dashboard at http://127.0.0.1:8265/#/logs (forwarded by ray dashboard cluster.yaml
), sometimes these logs would tell you whether the worker node is starting or hanging. And in my case, the head node is hanging on simple ssh issue;ssh hanging issue is tricky. in my case, it's due to ray is using same auth for all head node and worker nodes, but I didn't create a same user in worker node as header node. After create the same user on worker node and uncomment ssh_private_key
, the worker node could finally be sshed and started from header node.
like former comment mentions, if the worker node didn't stop the container properly, the header node still could not start worker node properly too, so you might need to docker stop RAY_CONTAINER_NAME
manually before ray up.
hopes these findings could help you
Hey folks, I ran into a similar issue when trying to set up an "On Prem" 1 click cluster via Lambda Labs.
I could start the cluster successfully when not using a docker image. But as soon as I switched to the docker image, I ran into the uninitialized issue.
I would get something like
poetry run ray status
======== Autoscaler status: 2024-10-16 21:28:29.027359 ========
Node status
---------------------------------------------------------------
Active:
1 local.cluster.node
Pending:
scrubbed_ip: local.cluster.node, uninitialized
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/8.0 CPU
0B/18.61GiB memory
0B/9.31GiB object_store_memory
Demands:
(no resource demands)
Here's the config.yaml I was using.
cluster_name: test-cluster
upscaling_speed: 1.0
docker:
container_name: basic-ray-ml-image
image: rayproject/ray-ml:latest-gpu
pull_before_run: true
provider:
type: local
head_ip: scrubbed_ip
worker_ips:
- scrubbed_ip
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/keypair
min_workers: 1
max_workers: 1
setup_commands:
- pip install ray[default]
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379
I managed to fix this by
docker ps
would return permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock
. I fixed this by running sudo usermod -aG docker $USER
, exiting the machine and then SSH'ing in again. This might be a lambda labs thing.Maybe this helps some people!
I feel like this stems from poor logging / error reporting from the other nodes.
Additionally I don't see any logging or log file.
Even though the instruction from poetry run ray monitor my_cluster.yaml
is to find logs at
==> /tmp/ray/session_latest/logs/monitor.out <==
I don't see such a file on any of the nodes
cat /tmp/ray/session_latest/logs/monitor.out
cat: /tmp/ray/session_latest/logs/monitor.out: No such file or directory
What happened + What you expected to happen
Running
ray up ray.yaml
I'd expect that all of the 4 nodes would be setup and join the cluster as I've setmin_workers: 4
.ray monitor ray.yaml
is showing the nodes asuninitialized
though.Versions / Dependencies
ray 2.6.4 python 3.9.18 manjaro
Reproduction script
ray.yaml
Issue Severity
High: It blocks me from completing my task.