[Core] Failed to register worker . Slurm - srun -

marianogabitto commented 1 year ago

What happened + What you expected to happen

I can't start ray.

I instantiate a node in a slurm cluster using:

srun -n 1 --exclusive -G 1 --pty bash

This allocates a node with 112 cpus and 4 gpus.

Then, within python:

import ray ray.init(num_cpus=20) 2022-11-03 21:17:31,752 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 [2022-11-03 21:18:32,436 E 251378 251378] core_worker.cc:149: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

On a different test: mport ray ray.init(ignore_reinit_error=True, num_cpus=10) 2022-11-03 21:19:01,734 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 RayContext(dashboard_url='127.0.0.1:8265', python_version='3.9.13', ray_version='2.0.1', ray_commit='03b6bc7b5a305877501110ec04710a9c57011479', address_info={'node_ip_address': '172.20.6.24', 'raylet_ip_address': '172.20.6.24', 'redis_address': None, 'object_store_address': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/plasma_store', 'raylet_socket_name': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630', 'metrics_export_port': 62537, 'gcs_address': '172.20.6.24:49967', 'address': '172.20.6.24:49967', 'dashboard_agent_listen_port': 52365, 'node_id': '0debcceedbef73619ccc8347450f5086693743e005ba9e907ae98c78'})

(raylet) [2022-11-03 21:19:31,639 E 252725 252765] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause. 2022-11-03 21:20:00,798 WARNING worker.py:1829 -- The node with node id: 0debcceedbef73619ccc8347450f5086693743e005ba9e907ae98c78 and address: 172.20.6.24 and node name: 172.20.6.24 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload.

Versions / Dependencies

DEPENDENCIES: Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0] on linux

RAY VERSION: 2.0.1 INSTALLATION: pip install -U "ray[default]" grpcio: 1.43.0

Reproduction script

import ray ray.init(num_cpus=20)

Issue Severity

High: It blocks me from completing my task.

marianogabitto commented 1 year ago

I am monitoring this issue so let me know if I need to convey more information or if I need to run or save any log file.

Thanks Ray Team !

michaelfeil commented 1 year ago

I encountered a similar issue. In my case, the _temp_dir was not writable from the cluster job. I added e.g. /home/username ray.init(num_cpus=num_cpus, num_gpus=num_gpus, _temp_dir=f"/home/mfeil/tmp", include_dashboard=False, ignore_reinit_error=True)

marianogabitto commented 1 year ago

This has not solved it for me . It is still dependent on the number of CPUs initiated.

Again, I allocate with slurm 112 cpus.

#########. THIS WORKS - 10 CPUS ######### import ray ray.init(include_dashboard=False, num_cpus=10, num_gpus=4, _temp_dir=f"/scratch/", ignore_reinit_error=True)

#########. THIS DOES NOT WORK - 20 CPUS ######### import ray ray.init(include_dashboard=False, num_cpus=20, num_gpus=4, _temp_dir=f"/scratch/", ignore_reinit_error=True)

2022-11-06 18:21:12,545 INFO worker.py:1518 -- Started a local Ray instance.

RayContext(dashboard_url=None, python_version='3.9.13', ray_version='2.0.1', ray_commit='03b6bc7b5a305877501110ec04710a9c57011479', address_info={'node_ip_address': '172.20.6.22', 'raylet_ip_address': '172.20.6.22', 'redis_address': None, 'object_store_address': '/scratch/session_2022-11-06_18-21-09_979887_243927/sockets/plasma_store', 'raylet_socket_name': '/scratch/session_2022-11-06_18-21-09_979887_243927/sockets/raylet', 'webui_url': None, 'session_dir': '/scratch/session_2022-11-06_18-21-09_979887_243927', 'metrics_export_port': 62846, 'gcs_address': '172.20.6.22:54835', 'address': '172.20.6.22:54835', 'dashboard_agent_listen_port': 52365, 'node_id': '1673a663ec1b58a2c8924abaf65438d32990ce10ee323cf216260d47'})

(raylet) [2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.

#########. AFTER FAILURE: CONTENT OF /SCRATCH/session_2022-11-06_18-21-09_979887_243927/logs/raylet.err

[2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.

#########. AFTER FAILURE: LAST LINES OF /SCRATCH/session_2022-11-06_18-21-09_979887_243927/logs/raylet.out

[2022-11-06 18:21:12,456 I 243968 243968] (raylet) accessor.cc:608: Received notification for node id = 1673a663ec1b58a2c8924abaf65438d32990ce10ee323cf216260d47, IsAlive = 1 [2022-11-06 18:21:12,555 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244037, the token is 0 [2022-11-06 18:21:12,556 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244038, the token is 1 [2022-11-06 18:21:12,557 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244039, the token is 2 [2022-11-06 18:21:12,559 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244040, the token is 3 [2022-11-06 18:21:12,560 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244041, the token is 4 [2022-11-06 18:21:12,563 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244042, the token is 5 [2022-11-06 18:21:12,565 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244043, the token is 6 [2022-11-06 18:21:12,566 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244044, the token is 7 [2022-11-06 18:21:12,567 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244045, the token is 8 [2022-11-06 18:21:12,569 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244046, the token is 9 [2022-11-06 18:21:12,571 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244047, the token is 10 [2022-11-06 18:21:12,573 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244048, the token is 11 [2022-11-06 18:21:12,579 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244049, the token is 12 [2022-11-06 18:21:12,587 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244050, the token is 13 [2022-11-06 18:21:12,593 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244051, the token is 14 [2022-11-06 18:21:12,595 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244052, the token is 15 [2022-11-06 18:21:12,596 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244053, the token is 16 [2022-11-06 18:21:12,598 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244054, the token is 17 [2022-11-06 18:21:12,599 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244055, the token is 18 [2022-11-06 18:21:12,600 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244056, the token is 19 [2022-11-06 18:21:21,446 W 243968 243986] (raylet) metric_exporter.cc:207: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster. [2022-11-06 18:21:41,680 I 243968 243992] (raylet) object_store.cc:35: Object store current usage 8e-09 / 157.089 GB. [2022-11-06 18:21:42,293 I 243968 243968] (raylet) node_manager.cc:599: New job has started. Job id 01000000 Driver pid 243927 is dead: 0 driver address: 172.20.6.22 [2022-11-06 18:21:42,293 I 243968 243968] (raylet) worker_pool.cc:636: Job 01000000 already started in worker pool. [2022-11-06 18:21:42,447 W 243968 243968] (raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip , id 0 [2022-11-06 18:21:42,458 W 243968 244019] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip . id 0 [2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.

Qinghao-Hu commented 1 year ago

I also meet the issue:

ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: BaseTrainer.as_trainable.<locals>.TrainTrainable
    actor_id: 4a0c820541ef7de7e95f887801000000
    namespace: cddfd55e-aa4a-4208-baa0-ccf6483f1ec5
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.140.24.68 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.

marianogabitto commented 1 year ago

This is still open. If anyone from the ray team tells me how to proceed. I can help troubleshoot.

marianogabitto commented 1 year ago

Pinging this issue to keep it open.

mgerstgrasser commented 1 year ago

Interesting, I keep running into variations of this as well on our slurm cluster.

@marianogabitto Are you saying that for you this happens even if you have the node allocated exclusively, i.e. no other slurm jobs running on that physical machine?

I have so far noticed two things that contribute to crashes for me:

Having two slurm jobs on the same physical machine each starting their own completely separate ray instance. - But it sounds like this can be ruled out in your case!
Something to do with how slurm pins processes to CPU cores. While I've never been able to reproduce crashes deterministically, I have noticed that they happen much, much more often with fewer CPU cores allocated. E.g. if I run a job with sbatch -c1 in slurm (ie. just a single CPU core), it will fail 50% of the time even if I do ray.init(num_cpus=1). On the other hand, sbatch -c2 will make this work most of the time, and it seems to me that sbatch -c2 and ray.init(num_cpus=1) crashes less frequently than sbatch -c2 and ray.init(num_cpus=2), i.e. leaving a little bit of "buffer" in the number of CPU cores helps. Is it possible that the way slurm pins processes to CPU cores interferes with how Ray likes to manage CPU cores? E.g. Ray tries to start two worker processses on different physical cores, but then Slurm puts them on the same core, and that causes problems?

This is purely speculation of course, but maybe at least my anecdotal data can be helpful in figuring this out.

marianogabitto commented 1 year ago

Hi @mgerstgrasser , thanks for reaching out. I am allocating the node exclusively to me. I run "srun -N 1 --exclusive -G 1 --pty bash" . I allocate 112 cpus, 512 gb ram and 1 GPU A100. I also allocate running time but it is not relevant here.The good thing is that I can reproduce the issue deterministically.

This is not the case and can be ruled out.
I always try to start ray with ~20 cpus. Thanks, M

mattip commented 1 year ago

Could someone with the problem zip the entire directory of log files and upload it? The error is, I think, a red herring: the worker cannot register because (I think) something is wrong with the main node. Either the connect messages are not being delivered or the raylet process is crashin, or it is sharing fate with the dashboard agent which is crashing. Perhaps some of the other log files have some hints.

marianogabitto commented 1 year ago

Matt, here it is . I am running:

(ray) [mg@n246 ~]$ python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(include_dashboard=False, num_cpus=30, _temp_dir=f"/scratch/")

marianogabitto commented 1 year ago

Here it is with the right file format

session.zip

mattip commented 1 year ago

could you also show what is in conda list?

marianogabitto commented 1 year ago

Sure, here it is "conda list" and "pip list" in the same condo environment called ray. It is a fresh installation.

conda_list.txt pip_list.txt

mgerstgrasser commented 1 year ago

Just in case an additional data point is helpful, I previously also uploaded logs once: https://github.com/ray-project/ray/issues/21479#issuecomment-1218347408 I later thought I had fixed the issue by setting num_cpu, but it turned out it merely made it less frequent.

mattip commented 1 year ago

Note that raylet.out exits before really finishing the startup cycle. The dashboard_agent.log is missing: for some reason it apparently was not created.

(raylet) metric_exporter.cc:207: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
(raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip , id 0
(raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip . id 0
(raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.`

marianogabitto commented 1 year ago

include_dashboard=False ????

mattip commented 1 year ago

I see you have conda's ray-core==2.0.1 which was uploaded only last week. Could you try using ray-serve? Perhaps ray-core is missing needed functionality

marianogabitto commented 1 year ago

Matt, I have just re-run it with the dashboard so you have now the log file. Trying ray serve.

session2.zip

Thanks !

mattip commented 1 year ago

conda install ray-serve

marianogabitto commented 1 year ago

Installed fresh repo with ray-all, same problem. I have not solved it.

conda config --env --add channels conda-forge conda config --set channel_priority strict

conda create -n rayserve python=3.10 conda install ray-all

marianogabitto commented 1 year ago

:(

mgerstgrasser commented 1 year ago

I just tried to reproduce your exact script on my cluster, and for me no error. (But plenty of occurences of seemingly the same error otherwise, randomly).

I ran salloc -n1 --exclusive -t 1:00:00 -p test and then more or less your reproduction script, with a sleep(600) after the ray.init(num_cpus=20) just in case. No error, even if I increase num_cpus, even beyond the number of physical cores in the machine... No GPU on that machine, and only 48 CPU cores though.

I'm attaching my conda list output just in case there's a package difference that's causing the problem for you. If I can help in any other way let me know. I'm keen on getting to the bottom of this too.

conda.txt

marianogabitto commented 1 year ago

@mgerstgrasser Quick question, how long does it take from the moment that you run ray.init() until finish ? Mine depends on the number of cpus . Is that your case too ?

mgerstgrasser commented 1 year ago

If I run it without the sleep() after the ray.init(), it takes 15-30 seconds, but doesn't seem to correlate with the number of CPUs. If anything it got faster the more often I tried in a row, but independently of the number of cpu cores I put into ray.init(). See below for the first few measurements, I did a few more after that but no big difference, it mostly took around 15-20s.

10 cpus:

$ time python test.py
2022-11-13 18:02:07,586 INFO worker.py:1518 -- Started a local Ray instance.

real    0m31.615s
user    0m3.551s
sys     0m0.890s

20 cpus:

$ time python test.py
2022-11-13 18:02:56,960 INFO worker.py:1518 -- Started a local Ray instance.

real    0m19.792s
user    0m3.610s
sys     0m0.891s

48 cpus:

$ time python test.py
2022-11-13 18:03:34,155 INFO worker.py:1518 -- Started a local Ray instance.

real    0m20.639s
user    0m4.028s
sys     0m1.370s

256 cpus (on a 48 core machine)

$ time python test.py
2022-11-13 18:04:13,104 INFO worker.py:1518 -- Started a local Ray instance.

real    0m17.411s
user    0m3.984s
sys     0m1.310s

marianogabitto commented 1 year ago

More Info. I did different installations on a machine with 88 cores. Up to ray 1.1.0, ray loads super fast and it does not show any problem.

This is what I found when : conda environment creates the python==xxxx repo and then pip install ray==xx.xx.xx

Python 3.7 and ray0.6.3: Max number cpus: unlimited==88 Python 3.7 and ray0.8: Max number cpus: unlimited==88 Python 3.7 and ray1.0.0: Max number cpus: unlimited==88

========== Python 3.7 and ray1.1.0: Max number cpus: unlimoted==88 Python 3.7 and ray1.2.0: Max number cpus: unlimited==88 Error related to: ModuleNotFoundError: No module named 'aiohttp.frozenlist' protobuf==3.15.3,

========== Python 3.7 and ray1.4.0: Max number cpus: 40 Python 3.7 and ray1.6.0: Max number cpus: 40 Python 3.7 and ray1.8.0: Max number cpus: 30 python 3.7 and ray 1.10: Max number cpus: 36 python 3.10 and ray 1.13: Max number cpus: 36 Python 3.10 and ray 2.0: Max number cpus: 32 Hangs with more than num_cpus==max number cpus

mattip commented 1 year ago

and then pip install ray==xx.xx.xx

pip install or conda install ?

marianogabitto commented 1 year ago

For the new version Ray 2.0. I do a conda install.

For the earlier versions, I do pip install because I was not sure that conda will have the old versions.

I am a first adopter of RAY as I was with Robert and Phillip in the lab while they were developing it. At the time, there was not a conda install, just a pip install. So, I reproduced that. In summary, pip install works spectacularly up to version 1.0.0

mgerstgrasser commented 1 year ago

FWIW, I've only ever pip installed ray on our cluster (even though it was inside a conda env), and also have issues.

marianogabitto commented 1 year ago

Ping ...

marianogabitto commented 1 year ago

Ping ....

marianogabitto commented 1 year ago

Ping ...

marianogabitto commented 1 year ago

Ping ...

mattip commented 1 year ago

I don't really understand slurm, but it seems the use of --exclusive when starting the slurm cluster can be problematic, see issue #30925. Did you verify that the cluster is working as intended before adding on ray?

jpgard commented 1 year ago

I'm also experiencing this issue with ray 2.2.0, Python 3.8 on Linux. I can reproduce it just by running

>>> import ray
>>> ray.init()
2022-12-24 21:59:47,962 INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
[2022-12-24 21:59:50,597 E 73199 73199] core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

I'll note that this is within a conda environment, and the same command, run from within the same conda environment, works fine (does not raise this error) on two other machines (Mac OS 11.7.2, and interactive nodes on a slurm cluster with GPUs).

jpgard commented 1 year ago

Following up here, after hacking in the logs, it looks like at least in my case, this might be related to too many OpenBLAS threads. (This makes sense, because it was happening on a machine with a large number of CPUs, 96 to be exact.)

In my case, I could see messages in the log files similar to the ones described here. I'm guessing that when running ray jobs that use many CPUs, there is some kind of issue with too many threads being used that prevents the workers from registering properly.

The solution in my case was to increase the number of pending signals, by running

ulimit -u 127590

This resolved the error and at least allowed running the full Ray pipeline. I can't say whether this is a good idea at a system level -- maybe someone else can advise about the pros and cons of this approach -- but it worked for me, ymmv.

mattip commented 1 year ago

I wonder if we should add a section to the FAQ about possible causes for this particular error message.

mgerstgrasser commented 1 year ago

One thing to add, I see there was another new comment in a related issue (now closed, so responding here) that mentioned a similar error when running two Ray instances in two notebooks at the same time: https://github.com/ray-project/ray/issues/21479#issuecomment-1399310106

That would fit with my hypothesis that at least one possible source of this error is from running two instances on the same machine, which is hard to avoid in Slurm. (Unless you do --exclusive to get a whole machine for yourself, but that may not be feasible in practice - on our cluster I think it would take days if not weeks for the scheduler to free up a whole machine!)

I wonder if containerising Ray might be one way to fix a lot of these diffuse issues on Slurm? On our cluster, we can't run docker directly on Slurm, but we can run Singularity, which can import docker images. It's not something I'll be able to work on myself right now, but just to put the idea out there for anyone else who's running into issues and is looking for a possible direction to explore.

wxie2013 commented 1 year ago

I have exactly the same problem in ray 2.3.0. What solved problem is that I manually run the following command

 ray start --head

then the script runs fine.

Without do it manually, tune.fit() will start a local Ray instance and somehow it leads to above error. I have this problem only in one new machine. Other machine runs fine. On the other hand, slurm jobs runs fine without the manual intervention. Sounds like an issue related to head ip_address

dejangrubisic commented 1 year ago

I have the same issue, by running Ubuntu20.04 in Singularity container on SLURM. I use ray.version in [2.2.0, 2.3.1]. This seems to be a show-stopper for many people here. I tried all proposed workarounds such as _temp_dir=f"/scratch/dx4/tmp", object_store_memory=78643200, specifying num_cpus=32, num_gpus=1 and nothing worked.

The only thing that worked for me is downgrading ray to 1.10.0, but this is certainly just a temporary solution.

mattip commented 1 year ago

This issue has collected a number of different reports, I think I saw these:

the head node dies
worker nodes fail to register with the proper head node when more than one is running
worker nodes die when starting up All these can apparently lead to the log message "Failed to register worker"

When commenting "same issue", please be more specific: what exactly did you try, on what hardware, and what happened.

mgerstgrasser commented 1 year ago

This issue has collected a number of different reports, I think I saw these:
* the head node dies

* worker nodes fail to register with the proper head node when more than one is running

* worker nodes die when starting up
  All these can apparently lead to the log message "Failed to register worker"
When commenting "same issue", please be more specific: what exactly did you try, on what hardware, and what happened.

Would it make sense and be possible to have Ray emit a more detailed error message here? One thing that makes it hard for me to report the problem in more detail is that the main log only shows the "Failed to register worker" and "IOError: [RayletClient] Unable to register worker with raylet. No such file or directory" messages. And it's impossible for me to figure out what other logs or information could be relevant.

At the very least, could Ray log which file the "no such file or directory" message refers to?

mattip commented 1 year ago

I usually am able to grep for the "No such file or directory" message in the log directory

ZixuWang commented 1 year ago

Following up here, after hacking in the logs, it looks like at least in my case, this might be related to too many OpenBLAS threads. (This makes sense, because it was happening on a machine with a large number of CPUs, 96 to be exact.)

In my case, I could see messages in the log files similar to the ones described here. I'm guessing that when running ray jobs that use many CPUs, there is some kind of issue with too many threads being used that prevents the workers from registering properly.

The solution in my case was to increase the number of pending signals, by running
ulimit -u 127590
This resolved the error and at least allowed running the full Ray pipeline. I can't say whether this is a good idea at a system level -- maybe someone else can advise about the pros and cons of this approach -- but it worked for me, ymmv.

thanks, it works for me after modifying it as ray.init(num_cpus=1)

David2265 commented 1 year ago

Hey people,

I has the same error as mentioned earlier, however, after I do "pip uninstall grpcio" and then reinstall using conda "conda install grpcio".

The error gone. and its working fine for me now! Peace.

dlwh commented 1 year ago

the pip uninstall grpcio/conda install gprcio trick didn't work for me. Also having issues under slurm (outside of slurm it seems to be ok)

dlwh commented 1 year ago

can confirm that downgrading to ray 1.13 fixes the issue

kaare-mikkelsen commented 1 year ago

I also have this issue, working from a singularity container. Is there anything I should try to copy / print to help move the debugging forward?

kaare-mikkelsen commented 1 year ago

Also, for me it seems to be a problem related to some persistent files? my installation worked for a while, but when I tried to better utilize the node I was working on (adding more gpu's and cpu's), I started getting the error. Now, reverting to the old code, the error persists. I am now only using 1 GPU and 10 CPUs, so I doubt it's related to number of OpenBLAS threads.

mattip commented 1 year ago

Post what version of ray and how you installed it.
Post what versions of grpcio you are using and how you installed it.

kaare-mikkelsen commented 1 year ago

Hi Mattip

My version of ray is 2.4.0, and it was installed using pip. I do not have grpcio installed.

ray-project / ray