ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.89k stars 5.76k forks source link

[Core] ray.init() hangs/fails after "Started a local Ray instance." #31897

Open joaoareis opened 1 year ago

joaoareis commented 1 year ago

What happened + What you expected to happen

Running the following snippet will hang indefinitely

>>> import  ray
>>> ray.init()
2023-01-24 11:44:47,741 INFO worker.py:1538 -- Started a local Ray instance.

Sometimes it will fail instead

[2023-01-24 11:50:22,050 E 31652 31652] core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
```Running the following snippet will hang indefinitely
```python
>>> import  ray
>>> ray.init()
2023-01-24 11:44:47,741 INFO worker.py:1538 -- Started a local Ray instance.

Sometimes it will fail instead

[2023-01-24 11:50:22,050 E 31652 31652] core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

Versions / Dependencies

Python 3.9.13 Ray 2.2.0 (installed with pip install --upgrade ray[rllib]) grpcio 1.43.0 OS: CentOS Linux 7

Reproduction script

import ray
ray.init()

Issue Severity

High: It blocks me from completing my task.

hora-anyscale commented 1 year ago

@joaoareis - can you please share more information about your environment?

CMode11 commented 1 year ago

Have you found a solution?

joaoareis commented 1 year ago

@joaoareis - can you please share more information about your environment?

Could you let me know which information should I share?

Have you found a solution?

I think I've solved it by adding a ray.init(num_cpus=56, num_gpus=2) right after importing ray, but it is still flaky.

rickyyx commented 1 year ago

Hey @joaoareis - could you share the the raylet log files in /tmp/ray/session_latest/raylet.out when you run into this issue?

joaoareis commented 1 year ago

Here they are raylet.out.txt

rickyyx commented 1 year ago

Looks like some of the workers were started but failed to start.

Could you try upgrading your grpcio? And if in your /tmp/ray/session_latest/ you see any files with python-core-worker- prefix, could you also paste one or two of those logs here? This will help understand what might have happened to the worker.

joaoareis commented 1 year ago

There you go, the two logs I checked looked identical. python-core-worker.log

I'll try to upgrade grpcio as well.

rickyyx commented 1 year ago

Interesting - seems some OS resources not available from the core worker logs.

  1. How many threads a process is allowed to create in your system? (I guess this could be obtained through something like cat /proc/sys/kernel/threads-max?

  2. Could you try running ulimit -n 65536 before starting the python program?

joaoareis commented 1 year ago

Hey! So:

bash-4.2$ cat /proc/sys/kernel/threads-max
1025624

I cannot run the ulimit as I don't have root/sudo access.

rickyyx commented 1 year ago

I see - I would see if upgrading grpcio solves the issue since that seems to be possible. We did have seen issues with grpcio occasionally.

KepingYan commented 1 year ago

Hello, is this problem solved? I also get the same error. I tried the suggestion mentioned above but it didn't work.

$ cat /proc/sys/kernel/threads-max
4113181
$ ulimit -n
65536

environment

Python 3.8.17
Ray 2.5.0
grpcio 1.51.3
Vamsi995 commented 1 year ago

Try installing grpcio version 1.48.1, it worked for me. My environment is as follows:

CentOS 7
Python 3.7.11
Ray 2.5.1
grpcio 1.48.1
AlexanderOllman commented 1 year ago

This issue still appears to persist. Attempted to run @Vamsi995's environment (except for Rocky 8.8 instead of CentOS) with no success.

This was after downgrading from ray==2.6.3 grcpio==1.57.0

Vamsi995 commented 1 year ago

image

This is how I imported and used ray.

man2machine commented 9 months ago

I am getting the same issue with ray==2.7.1, grpcio==1.59.2, python==3.11.5 on Ubuntu 20.04. Even if I specify the number of cpu as and gpus in ray.init, the function call still hangs.

LeSphax commented 7 months ago

I'm having the same issue running on ubuntu 22.04.3 (inside docker) with ray==2.10 (latest version) grpcio==1.62.1 (latest version) python==3.10.12

cat /proc/sys/kernel/threads-max -> 127772 ulimit -n 65536 -> No difference

It's just failing for me, not hanging

2024-04-01 20:10:20,206 INFO worker.py:1752 -- Started a local Ray instance.
[2024-04-01 20:10:20,501 E 122 122] core_worker.cc:228: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
cat /tmp/ray/session_latest/logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_686.log
[2024-04-01 20:15:54,469 I 686 686] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 686
[2024-04-01 20:15:54,485 I 686 686] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-04-01 20:15:54,846 E 686 686] core_worker.cc:228: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

raylet.out.txt

EDIT: It works fine outside docker (I'm on an M1 macbook)

yhchong commented 6 months ago

I am on python 3.10, ray 2.10, grpcio 1.62.1. It was working fine until I force stop a script after the instance started and before (or shortly after) the first worker. After that, import ray; ray.init() just hangs/fails. Fresh conda env doesn’t work. Specifying the num_cpus works up to 6.

arshiya031196 commented 6 months ago

Hi, I'm also facing the same issue. I'm using only one node and don't even need ray, only vLLM but internally it initializes a ray session and gets stuck indefinitely here:

2024-04-16 19:43:51,045 ERROR services.py:1330 -- Failed to start the dashboard 2024-04-16 19:43:51,045 ERROR services.py:1355 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is. 2024-04-16 19:43:51,045 ERROR services.py:1365 -- Couldn't read dashboard.log file. Error: [Errno 2] No such file or directory: '/tmp/ray/session_2024-04-16_19-43-09_468986_3298/logs/dashboard.log'. It means the dashboard is broken even before it initializes the logger (mostly dependency issues). Reading the dashboard.err file which contains stdout/stderr. 2024-04-16 19:43:51,045 ERROR services.py:1399 -- Failed to read dashboard.err file: cannot mmap an empty file. It is unexpected. Please report an issue to Ray github. https://github.com/ray-project/ray/issues 2024-04-16 19:43:53,550 INFO worker.py:1752 -- Started a local Ray instance.

Is there some way to disable ray in only vLLM scripts or mitigate this issue?

yangalan123 commented 5 months ago

For vLLM, it works for me that I just uninstall grpcio, ray and vllm and re-install latest version of vllm (==0.4.3, which automatically install ray==2.24.0). Hope that helps! @arshiya031196

ChenchenHu007 commented 4 months ago

The same issue occurred in ray 2.31.0 when I executed ray.init(num_cpus=48) on a jupyter notebook and conda environment. Setting num_cpus=2 works, but doesn't fully utilize the hardware.