Open joaoareis opened 1 year ago
@joaoareis - can you please share more information about your environment?
Have you found a solution?
@joaoareis - can you please share more information about your environment?
Could you let me know which information should I share?
Have you found a solution?
I think I've solved it by adding a ray.init(num_cpus=56, num_gpus=2)
right after importing ray, but it is still flaky.
Hey @joaoareis - could you share the the raylet log files in /tmp/ray/session_latest/raylet.out
when you run into this issue?
Here they are raylet.out.txt
Looks like some of the workers were started but failed to start.
Could you try upgrading your grpcio?
And if in your /tmp/ray/session_latest/
you see any files with python-core-worker-
prefix, could you also paste one or two of those logs here? This will help understand what might have happened to the worker.
There you go, the two logs I checked looked identical. python-core-worker.log
I'll try to upgrade grpcio as well.
Interesting - seems some OS resources not available from the core worker logs.
How many threads a process is allowed to create in your system? (I guess this could be obtained through something like cat /proc/sys/kernel/threads-max
?
Could you try running ulimit -n 65536
before starting the python program?
Hey! So:
bash-4.2$ cat /proc/sys/kernel/threads-max
1025624
I cannot run the ulimit
as I don't have root/sudo access.
I see - I would see if upgrading grpcio solves the issue since that seems to be possible. We did have seen issues with grpcio occasionally.
Hello, is this problem solved? I also get the same error. I tried the suggestion mentioned above but it didn't work.
$ cat /proc/sys/kernel/threads-max
4113181
$ ulimit -n
65536
environment
Python 3.8.17
Ray 2.5.0
grpcio 1.51.3
Try installing grpcio version 1.48.1, it worked for me. My environment is as follows:
CentOS 7
Python 3.7.11
Ray 2.5.1
grpcio 1.48.1
This issue still appears to persist. Attempted to run @Vamsi995's environment (except for Rocky 8.8 instead of CentOS) with no success.
This was after downgrading from ray==2.6.3 grcpio==1.57.0
This is how I imported and used ray.
I am getting the same issue with ray==2.7.1, grpcio==1.59.2, python==3.11.5
on Ubuntu 20.04. Even if I specify the number of cpu as and gpus in ray.init
, the function call still hangs.
I'm having the same issue running on ubuntu 22.04.3 (inside docker) with ray==2.10 (latest version) grpcio==1.62.1 (latest version) python==3.10.12
cat /proc/sys/kernel/threads-max -> 127772 ulimit -n 65536 -> No difference
It's just failing for me, not hanging
2024-04-01 20:10:20,206 INFO worker.py:1752 -- Started a local Ray instance.
[2024-04-01 20:10:20,501 E 122 122] core_worker.cc:228: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
cat /tmp/ray/session_latest/logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_686.log
[2024-04-01 20:15:54,469 I 686 686] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 686
[2024-04-01 20:15:54,485 I 686 686] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-04-01 20:15:54,846 E 686 686] core_worker.cc:228: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
EDIT: It works fine outside docker (I'm on an M1 macbook)
I am on python 3.10, ray 2.10, grpcio 1.62.1. It was working fine until I force stop a script after the instance started and before (or shortly after) the first worker. After that, import ray; ray.init() just hangs/fails. Fresh conda env doesn’t work. Specifying the num_cpus works up to 6.
Hi, I'm also facing the same issue. I'm using only one node and don't even need ray, only vLLM but internally it initializes a ray session and gets stuck indefinitely here:
2024-04-16 19:43:51,045 ERROR services.py:1330 -- Failed to start the dashboard 2024-04-16 19:43:51,045 ERROR services.py:1355 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is. 2024-04-16 19:43:51,045 ERROR services.py:1365 -- Couldn't read dashboard.log file. Error: [Errno 2] No such file or directory: '/tmp/ray/session_2024-04-16_19-43-09_468986_3298/logs/dashboard.log'. It means the dashboard is broken even before it initializes the logger (mostly dependency issues). Reading the dashboard.err file which contains stdout/stderr. 2024-04-16 19:43:51,045 ERROR services.py:1399 -- Failed to read dashboard.err file: cannot mmap an empty file. It is unexpected. Please report an issue to Ray github. https://github.com/ray-project/ray/issues 2024-04-16 19:43:53,550 INFO worker.py:1752 -- Started a local Ray instance.
Is there some way to disable ray in only vLLM scripts or mitigate this issue?
For vLLM, it works for me that I just uninstall grpcio
, ray
and vllm
and re-install latest version of vllm
(==0.4.3, which automatically install ray==2.24.0). Hope that helps! @arshiya031196
The same issue occurred in ray 2.31.0 when I executed ray.init(num_cpus=48) on a jupyter notebook and conda environment. Setting num_cpus=2 works, but doesn't fully utilize the hardware.
What happened + What you expected to happen
Running the following snippet will hang indefinitely
Sometimes it will fail instead
Sometimes it will fail instead
Versions / Dependencies
Python 3.9.13 Ray 2.2.0 (installed with
pip install --upgrade ray[rllib]
) grpcio 1.43.0 OS: CentOS Linux 7Reproduction script
Issue Severity
High: It blocks me from completing my task.