Open marianogabitto opened 1 year ago
I am monitoring this issue so let me know if I need to convey more information or if I need to run or save any log file.
Thanks Ray Team !
I encountered a similar issue. In my case, the _temp_dir was not writable from the cluster job. I added e.g. /home/username
ray.init(num_cpus=num_cpus, num_gpus=num_gpus, _temp_dir=f"/home/mfeil/tmp", include_dashboard=False, ignore_reinit_error=True)
This has not solved it for me . It is still dependent on the number of CPUs initiated.
Again, I allocate with slurm 112 cpus.
#########. THIS WORKS - 10 CPUS ######### import ray ray.init(include_dashboard=False, num_cpus=10, num_gpus=4, _temp_dir=f"/scratch/", ignore_reinit_error=True)
#########. THIS DOES NOT WORK - 20 CPUS ######### import ray ray.init(include_dashboard=False, num_cpus=20, num_gpus=4, _temp_dir=f"/scratch/", ignore_reinit_error=True)
2022-11-06 18:21:12,545 INFO worker.py:1518 -- Started a local Ray instance.
RayContext(dashboard_url=None, python_version='3.9.13', ray_version='2.0.1', ray_commit='03b6bc7b5a305877501110ec04710a9c57011479', address_info={'node_ip_address': '172.20.6.22', 'raylet_ip_address': '172.20.6.22', 'redis_address': None, 'object_store_address': '/scratch/session_2022-11-06_18-21-09_979887_243927/sockets/plasma_store', 'raylet_socket_name': '/scratch/session_2022-11-06_18-21-09_979887_243927/sockets/raylet', 'webui_url': None, 'session_dir': '/scratch/session_2022-11-06_18-21-09_979887_243927', 'metrics_export_port': 62846, 'gcs_address': '172.20.6.22:54835', 'address': '172.20.6.22:54835', 'dashboard_agent_listen_port': 52365, 'node_id': '1673a663ec1b58a2c8924abaf65438d32990ce10ee323cf216260d47'})
(raylet) [2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log
for the root cause.
#########. AFTER FAILURE: CONTENT OF /SCRATCH/session_2022-11-06_18-21-09_979887_243927/logs/raylet.err
[2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log
for the root cause.
#########. AFTER FAILURE: LAST LINES OF /SCRATCH/session_2022-11-06_18-21-09_979887_243927/logs/raylet.out
[2022-11-06 18:21:12,456 I 243968 243968] (raylet) accessor.cc:608: Received notification for node id = 1673a663ec1b58a2c8924abaf65438d32990ce10ee323cf216260d47, IsAlive = 1
[2022-11-06 18:21:12,555 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244037, the token is 0
[2022-11-06 18:21:12,556 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244038, the token is 1
[2022-11-06 18:21:12,557 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244039, the token is 2
[2022-11-06 18:21:12,559 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244040, the token is 3
[2022-11-06 18:21:12,560 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244041, the token is 4
[2022-11-06 18:21:12,563 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244042, the token is 5
[2022-11-06 18:21:12,565 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244043, the token is 6
[2022-11-06 18:21:12,566 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244044, the token is 7
[2022-11-06 18:21:12,567 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244045, the token is 8
[2022-11-06 18:21:12,569 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244046, the token is 9
[2022-11-06 18:21:12,571 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244047, the token is 10
[2022-11-06 18:21:12,573 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244048, the token is 11
[2022-11-06 18:21:12,579 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244049, the token is 12
[2022-11-06 18:21:12,587 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244050, the token is 13
[2022-11-06 18:21:12,593 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244051, the token is 14
[2022-11-06 18:21:12,595 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244052, the token is 15
[2022-11-06 18:21:12,596 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244053, the token is 16
[2022-11-06 18:21:12,598 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244054, the token is 17
[2022-11-06 18:21:12,599 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244055, the token is 18
[2022-11-06 18:21:12,600 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244056, the token is 19
[2022-11-06 18:21:21,446 W 243968 243986] (raylet) metric_exporter.cc:207: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2022-11-06 18:21:41,680 I 243968 243992] (raylet) object_store.cc:35: Object store current usage 8e-09 / 157.089 GB.
[2022-11-06 18:21:42,293 I 243968 243968] (raylet) node_manager.cc:599: New job has started. Job id 01000000 Driver pid 243927 is dead: 0 driver address: 172.20.6.22
[2022-11-06 18:21:42,293 I 243968 243968] (raylet) worker_pool.cc:636: Job 01000000 already started in worker pool.
[2022-11-06 18:21:42,447 W 243968 243968] (raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip , id 0
[2022-11-06 18:21:42,458 W 243968 244019] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip . id 0
[2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log
for the root cause.
I also meet the issue:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: BaseTrainer.as_trainable.<locals>.TrainTrainable
actor_id: 4a0c820541ef7de7e95f887801000000
namespace: cddfd55e-aa4a-4208-baa0-ccf6483f1ec5
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.140.24.68 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.
This is still open. If anyone from the ray team tells me how to proceed. I can help troubleshoot.
Pinging this issue to keep it open.
Interesting, I keep running into variations of this as well on our slurm cluster.
@marianogabitto Are you saying that for you this happens even if you have the node allocated exclusively, i.e. no other slurm jobs running on that physical machine?
I have so far noticed two things that contribute to crashes for me:
sbatch -c1
in slurm (ie. just a single CPU core), it will fail 50% of the time even if I do ray.init(num_cpus=1)
. On the other hand, sbatch -c2
will make this work most of the time, and it seems to me that sbatch -c2
and ray.init(num_cpus=1)
crashes less frequently than sbatch -c2
and ray.init(num_cpus=2)
, i.e. leaving a little bit of "buffer" in the number of CPU cores helps. Is it possible that the way slurm pins processes to CPU cores interferes with how Ray likes to manage CPU cores? E.g. Ray tries to start two worker processses on different physical cores, but then Slurm puts them on the same core, and that causes problems?This is purely speculation of course, but maybe at least my anecdotal data can be helpful in figuring this out.
Hi @mgerstgrasser , thanks for reaching out. I am allocating the node exclusively to me. I run "srun -N 1 --exclusive -G 1 --pty bash" . I allocate 112 cpus, 512 gb ram and 1 GPU A100. I also allocate running time but it is not relevant here.The good thing is that I can reproduce the issue deterministically.
Could someone with the problem zip the entire directory of log files and upload it? The error is, I think, a red herring: the worker cannot register because (I think) something is wrong with the main node. Either the connect messages are not being delivered or the raylet process is crashin, or it is sharing fate with the dashboard agent which is crashing. Perhaps some of the other log files have some hints.
Matt, here it is . I am running:
(ray) [mg@n246 ~]$ python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(include_dashboard=False, num_cpus=30, _temp_dir=f"/scratch/")
Here it is with the right file format
could you also show what is in conda list
?
Sure, here it is "conda list" and "pip list" in the same condo environment called ray. It is a fresh installation.
Just in case an additional data point is helpful, I previously also uploaded logs once: https://github.com/ray-project/ray/issues/21479#issuecomment-1218347408
I later thought I had fixed the issue by setting num_cpu
, but it turned out it merely made it less frequent.
Note that raylet.out
exits before really finishing the startup cycle. The dashboard_agent.log is missing: for some reason it apparently was not created.
(raylet) metric_exporter.cc:207: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
(raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip , id 0
(raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip . id 0
(raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.`
include_dashboard=False ????
I see you have conda's ray-core==2.0.1
which was uploaded only last week. Could you try using ray-serve
? Perhaps ray-core
is missing needed functionality
Matt, I have just re-run it with the dashboard so you have now the log file. Trying ray serve.
Thanks !
conda install ray-serve
Installed fresh repo with ray-all, same problem. I have not solved it.
conda config --env --add channels conda-forge conda config --set channel_priority strict
conda create -n rayserve python=3.10 conda install ray-all
:(
I just tried to reproduce your exact script on my cluster, and for me no error. (But plenty of occurences of seemingly the same error otherwise, randomly).
I ran salloc -n1 --exclusive -t 1:00:00 -p test
and then more or less your reproduction script, with a sleep(600)
after the ray.init(num_cpus=20)
just in case. No error, even if I increase num_cpus
, even beyond the number of physical cores in the machine... No GPU on that machine, and only 48 CPU cores though.
I'm attaching my conda list
output just in case there's a package difference that's causing the problem for you. If I can help in any other way let me know. I'm keen on getting to the bottom of this too.
@mgerstgrasser Quick question, how long does it take from the moment that you run ray.init() until finish ? Mine depends on the number of cpus . Is that your case too ?
If I run it without the sleep()
after the ray.init()
, it takes 15-30 seconds, but doesn't seem to correlate with the number of CPUs. If anything it got faster the more often I tried in a row, but independently of the number of cpu cores I put into ray.init()
. See below for the first few measurements, I did a few more after that but no big difference, it mostly took around 15-20s.
10 cpus:
$ time python test.py
2022-11-13 18:02:07,586 INFO worker.py:1518 -- Started a local Ray instance.
real 0m31.615s
user 0m3.551s
sys 0m0.890s
20 cpus:
$ time python test.py
2022-11-13 18:02:56,960 INFO worker.py:1518 -- Started a local Ray instance.
real 0m19.792s
user 0m3.610s
sys 0m0.891s
48 cpus:
$ time python test.py
2022-11-13 18:03:34,155 INFO worker.py:1518 -- Started a local Ray instance.
real 0m20.639s
user 0m4.028s
sys 0m1.370s
256 cpus (on a 48 core machine)
$ time python test.py
2022-11-13 18:04:13,104 INFO worker.py:1518 -- Started a local Ray instance.
real 0m17.411s
user 0m3.984s
sys 0m1.310s
More Info. I did different installations on a machine with 88 cores. Up to ray 1.1.0, ray loads super fast and it does not show any problem.
This is what I found when : conda environment creates the python==xxxx repo and then pip install ray==xx.xx.xx
Python 3.7 and ray0.6.3: Max number cpus: unlimited==88 Python 3.7 and ray0.8: Max number cpus: unlimited==88 Python 3.7 and ray1.0.0: Max number cpus: unlimited==88
========== Python 3.7 and ray1.1.0: Max number cpus: unlimoted==88 Python 3.7 and ray1.2.0: Max number cpus: unlimited==88 Error related to: ModuleNotFoundError: No module named 'aiohttp.frozenlist' protobuf==3.15.3,
========== Python 3.7 and ray1.4.0: Max number cpus: 40 Python 3.7 and ray1.6.0: Max number cpus: 40 Python 3.7 and ray1.8.0: Max number cpus: 30 python 3.7 and ray 1.10: Max number cpus: 36 python 3.10 and ray 1.13: Max number cpus: 36 Python 3.10 and ray 2.0: Max number cpus: 32 Hangs with more than num_cpus==max number cpus
and then pip install ray==xx.xx.xx
pip install
or conda install
?
For the new version Ray 2.0. I do a conda install.
For the earlier versions, I do pip install because I was not sure that conda will have the old versions.
I am a first adopter of RAY as I was with Robert and Phillip in the lab while they were developing it. At the time, there was not a conda install, just a pip install. So, I reproduced that. In summary, pip install works spectacularly up to version 1.0.0
FWIW, I've only ever pip install
ed ray on our cluster (even though it was inside a conda env), and also have issues.
Ping ...
Ping ....
Ping ...
Ping ...
I don't really understand slurm, but it seems the use of --exclusive
when starting the slurm cluster can be problematic, see issue #30925. Did you verify that the cluster is working as intended before adding on ray?
I'm also experiencing this issue with ray 2.2.0, Python 3.8 on Linux. I can reproduce it just by running
>>> import ray
>>> ray.init()
2022-12-24 21:59:47,962 INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
[2022-12-24 21:59:50,597 E 73199 73199] core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
I'll note that this is within a conda environment, and the same command, run from within the same conda environment, works fine (does not raise this error) on two other machines (Mac OS 11.7.2, and interactive nodes on a slurm cluster with GPUs).
Following up here, after hacking in the logs, it looks like at least in my case, this might be related to too many OpenBLAS threads. (This makes sense, because it was happening on a machine with a large number of CPUs, 96 to be exact.)
In my case, I could see messages in the log files similar to the ones described here. I'm guessing that when running ray jobs that use many CPUs, there is some kind of issue with too many threads being used that prevents the workers from registering properly.
The solution in my case was to increase the number of pending signals, by running
ulimit -u 127590
This resolved the error and at least allowed running the full Ray pipeline. I can't say whether this is a good idea at a system level -- maybe someone else can advise about the pros and cons of this approach -- but it worked for me, ymmv.
I wonder if we should add a section to the FAQ about possible causes for this particular error message.
One thing to add, I see there was another new comment in a related issue (now closed, so responding here) that mentioned a similar error when running two Ray instances in two notebooks at the same time: https://github.com/ray-project/ray/issues/21479#issuecomment-1399310106
That would fit with my hypothesis that at least one possible source of this error is from running two instances on the same machine, which is hard to avoid in Slurm. (Unless you do --exclusive to get a whole machine for yourself, but that may not be feasible in practice - on our cluster I think it would take days if not weeks for the scheduler to free up a whole machine!)
I wonder if containerising Ray might be one way to fix a lot of these diffuse issues on Slurm? On our cluster, we can't run docker directly on Slurm, but we can run Singularity, which can import docker images. It's not something I'll be able to work on myself right now, but just to put the idea out there for anyone else who's running into issues and is looking for a possible direction to explore.
I have exactly the same problem in ray 2.3.0. What solved problem is that I manually run the following command
ray start --head
then the script runs fine.
Without do it manually, tune.fit() will start a local Ray instance and somehow it leads to above error. I have this problem only in one new machine. Other machine runs fine. On the other hand, slurm jobs runs fine without the manual intervention. Sounds like an issue related to head ip_address
I have the same issue, by running Ubuntu20.04 in Singularity container on SLURM. I use ray.version in [2.2.0, 2.3.1]. This seems to be a show-stopper for many people here. I tried all proposed workarounds such as _temp_dir=f"/scratch/dx4/tmp", object_store_memory=78643200, specifying num_cpus=32, num_gpus=1 and nothing worked.
The only thing that worked for me is downgrading ray to 1.10.0, but this is certainly just a temporary solution.
This issue has collected a number of different reports, I think I saw these:
When commenting "same issue", please be more specific: what exactly did you try, on what hardware, and what happened.
This issue has collected a number of different reports, I think I saw these:
* the head node dies * worker nodes fail to register with the proper head node when more than one is running * worker nodes die when starting up All these can apparently lead to the log message "Failed to register worker"
When commenting "same issue", please be more specific: what exactly did you try, on what hardware, and what happened.
Would it make sense and be possible to have Ray emit a more detailed error message here? One thing that makes it hard for me to report the problem in more detail is that the main log only shows the "Failed to register worker" and "IOError: [RayletClient] Unable to register worker with raylet. No such file or directory" messages. And it's impossible for me to figure out what other logs or information could be relevant.
At the very least, could Ray log which file the "no such file or directory" message refers to?
I usually am able to grep for the "No such file or directory" message in the log directory
Following up here, after hacking in the logs, it looks like at least in my case, this might be related to too many OpenBLAS threads. (This makes sense, because it was happening on a machine with a large number of CPUs, 96 to be exact.)
In my case, I could see messages in the log files similar to the ones described here. I'm guessing that when running ray jobs that use many CPUs, there is some kind of issue with too many threads being used that prevents the workers from registering properly.
The solution in my case was to increase the number of pending signals, by running
ulimit -u 127590
This resolved the error and at least allowed running the full Ray pipeline. I can't say whether this is a good idea at a system level -- maybe someone else can advise about the pros and cons of this approach -- but it worked for me, ymmv.
thanks, it works for me after modifying it as ray.init(num_cpus=1)
Hey people,
I has the same error as mentioned earlier, however, after I do "pip uninstall grpcio" and then reinstall using conda "conda install grpcio".
The error gone. and its working fine for me now! Peace.
the pip uninstall grpcio
/conda install gprcio
trick didn't work for me. Also having issues under slurm (outside of slurm it seems to be ok)
can confirm that downgrading to ray 1.13 fixes the issue
I also have this issue, working from a singularity container. Is there anything I should try to copy / print to help move the debugging forward?
Also, for me it seems to be a problem related to some persistent files? my installation worked for a while, but when I tried to better utilize the node I was working on (adding more gpu's and cpu's), I started getting the error. Now, reverting to the old code, the error persists. I am now only using 1 GPU and 10 CPUs, so I doubt it's related to number of OpenBLAS threads.
Hi Mattip
My version of ray is 2.4.0, and it was installed using pip. I do not have grpcio installed.
What happened + What you expected to happen
I can't start ray.
I instantiate a node in a slurm cluster using:
srun -n 1 --exclusive -G 1 --pty bash
This allocates a node with 112 cpus and 4 gpus.
Then, within python:
import ray ray.init(num_cpus=20) 2022-11-03 21:17:31,752 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 [2022-11-03 21:18:32,436 E 251378 251378] core_worker.cc:149: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
On a different test: mport ray ray.init(ignore_reinit_error=True, num_cpus=10) 2022-11-03 21:19:01,734 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 RayContext(dashboard_url='127.0.0.1:8265', python_version='3.9.13', ray_version='2.0.1', ray_commit='03b6bc7b5a305877501110ec04710a9c57011479', address_info={'node_ip_address': '172.20.6.24', 'raylet_ip_address': '172.20.6.24', 'redis_address': None, 'object_store_address': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/plasma_store', 'raylet_socket_name': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630', 'metrics_export_port': 62537, 'gcs_address': '172.20.6.24:49967', 'address': '172.20.6.24:49967', 'dashboard_agent_listen_port': 52365, 'node_id': '0debcceedbef73619ccc8347450f5086693743e005ba9e907ae98c78'})
Versions / Dependencies
DEPENDENCIES: Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0] on linux
RAY VERSION: 2.0.1 INSTALLATION: pip install -U "ray[default]" grpcio: 1.43.0
Reproduction script
import ray ray.init(num_cpus=20)
Issue Severity
High: It blocks me from completing my task.