ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.29k stars 5.63k forks source link

[Core] ray.init() stuck at "Started a local Ray instance." #37373

Closed AvisP closed 2 months ago

AvisP commented 1 year ago

What happened + What you expected to happen

When I am trying to do a ray.init(logging_level='debug') on a windows server it gets stuck at Started a local Ray instance

Versions / Dependencies

Ray version[RLLIB]: 2.5.1 OS: Windows 10 Python: 3.8.5

Reproduction script

I am just doing

import ray ray.init(logging_level='debug')

Debug message I am getting is

2023-07-12 17:07:30,736 DEBUG worker.py:1412 -- Could not import resource module (on Windows)
2023-07-12 17:07:37,350 DEBUG node.py:1149 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\3\ray\session_2023-07-12_17-07-37_304650_56384\logs.
2023-07-12 17:07:37,473 DEBUG node.py:605 -- Failed to send request to gcs, reconnecting. Error failed to connect to all addresses
2023-07-12 17:07:41,530 DEBUG node.py:1187 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\3\ray\session_2023-07-12_17-07-37_304650_56384\logs.
2023-07-12 17:07:41,820 DEBUG services.py:1894 -- Determine to start the Plasma object store with 72.38 GB memory using C:\Users\paula\AppData\Local\Temp\3.
2023-07-12 17:07:42,004 INFO worker.py:1636 -- Started a local Ray instance.

Issue Severity

High: It blocks me from completing my task.

rkooo567 commented 1 year ago

cc @mattip can you follow up and triage?

mattip commented 1 year ago

What CPU and how many cores does it have? How much memory? Does running in regular mode, without debug work?

AvisP commented 1 year ago

It is a server with Intel Xenon Gold 6230 CPU @ 2.1 Ghz (80 cores as seen on Task manager, maybe because of multi threading it is showing more) with 384 GB of ram. I tried doing a ray.init() and getting the following

Traceback (most recent call last):
  File "F:\........\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\node.py", line 293, in __init__
    ray._private.services.wait_for_node(
  File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\services.py", line 459, in wait_for_node
    raise TimeoutError(
TimeoutError: Timed out after 60 seconds while waiting for node to startup. Did not find socket name tcp://127.0.0.1:58449 in the list of object store socket names.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\worker.py", line 1534, in init
    _global_node = ray._private.node.Node(
  File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\node.py", line 298, in __init__
    raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.
mattip commented 1 year ago

80 cores as seen on Task manager

Maybe connected to the large number of cores taking too long to spin up all the processes on windows. In order to test that hypothesis, could you try ray.init(num_cpus=N) and figure out how large N can be before the function fails? I do not have access to such a large machine.

AvisP commented 1 year ago

Okay so I did a ray.init(num_cpus=2) and got the following message and the terminal froze. The grpcio version I have is 1.56.0

2023-07-26 10:48:42,392 INFO worker.py:1636 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.8.5', ray_version='2.5.1', ray_commit='a03efd9931128d387649dd48b0e4864b43d3bfb4', address_info={'node
_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': 'tcp://127.0.0.1:65389', 'raylet_socket_
name': 'tcp://127.0.0.1:65442', 'webui_url': '', 'session_dir': 'C:\\Users\\......\\AppData\\Local\\Temp\\6\\ray\\session_2023-07-26_10-48-36_478614_
41080', 'metrics_export_port': 65526, 'gcs_address': '127.0.0.1:65386', 'address': '127.0.0.1:65386', 'dashboard_agent_listen_port': 52365, 'node_id
': 'd1b0c0601ebf26445176b6bb12a84bf4bc1a49f1c3414177d2b0e103'})
>>> (raylet) [2023-07-26 10:48:45,736 E 51140 31988] (raylet.exe) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed.
The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
(raylet) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio versio
n `pip freeze | grep grpcio`.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/dashboard_agent.lo
g`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
(raylet) *** SIGTERM received at time=1690382925 ***
(raylet)     @   00007FF61E2B0F6C  (unknown)  (unknown)
(raylet)     @   00007FF61E2B03DE  (unknown)  (unknown)
(raylet)     @   00007FF8E27B268A  (unknown)  o_exp
(raylet)     @   00007FF8E5FA7AC4  (unknown)  BaseThreadInitThunk
(raylet)     @   00007FF8E627A351  (unknown)  RtlUserThreadStart
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361: *** SIGTERM received at time=1690382925 ***
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF61E2B0F6C  (unknown)  (unknown)
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF61E2B03DE  (unknown)  (unknown)
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF8E27B268A  (unknown)  o_exp
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF8E5FA7AC4  (unknown)  BaseThreadInitThunk
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF8E627A351  (unknown)  RtlUserThreadStart
2023-07-26 10:48:59,384 WARNING worker.py:2019 -- The node with node id: d1b0c0601ebf26445176b6bb12a84bf4bc1a49f1c3414177d2b0e103 and address: 127.0
.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a  (1) raylet c
rashes unexpectedly (OOM, preempted node, etc.)
        (2) raylet has lagging heartbeats due to slow network or busy workload.
xieus commented 1 year ago

@mattip Could you pls provide an update on this issue? Thanks.

mattip commented 1 year ago

If ray 2.7 moves us to a world without python grpcio, we can revisit the issues that have

The version of grpcio doesn't follow Ray's requirements

in the logs

mattip commented 1 year ago

There are quite a few similar errors, I think they all using a non-compliant version of grpcio

mattip commented 11 months ago

@AvisP does this still happen on ray 2.8?

AvisP commented 10 months ago

@mattip For the windows server with lots of cores I got this when I did a ray.init(num_cpus=2) on python 3.10.11

2023-11-22 16:47:09,371 INFO worker.py:1673 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.8.0', ray_commit='105355bd253d6538ed34d331f6a4bdf0e38ace3a', protocol_version=None)
>>> (raylet) [2023-11-22 16:47:12,284 E 68664 94064] (raylet.exe) agent_manager.cc:70: The raylet exited immediately because one Ray agent failed, agent_name = runtime_env_agent.
(raylet) The raylet fate shares with the agent. This can happen because
(raylet) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
(raylet) *** SIGTERM received at time=1700689632 ***
(raylet)     @   00007FF6A41785C6  (unknown)  (unknown)
(raylet)     @   00007FF6A418EE86  (unknown)  (unknown)
(raylet)     @   00007FF6A418E5BE  (unknown)  (unknown)
(raylet)     @   00007FF8C242268A  (unknown)  o_exp
(raylet)     @   00007FF8C2797AC4  (unknown)  BaseThreadInitThunk
(raylet)     @   00007FF8C551A351  (unknown)  RtlUserThreadStart
(raylet) [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: *** SIGTERM received at time=1700689632 ***
(raylet) [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF6A41785C6  (unknown)  (unknown)
(raylet) [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF6A418EE86  (unknown)  (unknown)
(raylet) [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF6A418E5BE  (unknown)  (unknown)
(raylet) [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF8C242268A  (unknown)  o_exp
(raylet) [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF8C2797AC4  (unknown)  BaseThreadInitThunk
(raylet) [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF8C551A351  (unknown)  RtlUserThreadStart
2023-11-22 16:47:26,277 WARNING worker.py:2074 -- The node with node id: 4fa93e7451f6a84361e9c2a042e02e8cfe57ae10e06bc9cf428a4ddf and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a        (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
        (2) raylet has lagging heartbeats due to slow network or busy workload.

Then I interrupted using keyboard and did a ray.shutdown() and tried initializing again and then I got

2023-11-22 16:48:22,439 INFO worker.py:1673 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.8.0', ray_commit='105355bd253d6538ed34d331f6a4bdf0e38ace3a', protocol_version=None)

Although it works fine on windows laptop with smaller number of cores.

For python 3.11 I can't even install ray on a windows based environment due to this issue https://github.com/ray-project/ray/issues/39727 https://github.com/ray-project/ray/issues/38300

mattip commented 9 months ago

I am intrigued by the The version ofgrpciodoesn't follow Ray's requirement in the log. How exactly are you installing the software stack? What is pulling in a non-compliant version of grpcio? Perhaps you are seeing some stall due to some other package, like NumPy, that tries to open worker threads on every core?

anyscalesam commented 9 months ago

@AvisP can you provide more details to @mattip Q above?

AvisP commented 8 months ago

I think I just did pip install ray, I wanted to confirm by testing on the server with multiple cores running Windows but I can get access end of next week and provide more info on what I see during installation.

mattip commented 7 months ago

Any update of how you installed the software?

AvisP commented 7 months ago

Sorry for late response. So I created a new virtual environment with python 3.10.11 and did a pip install ray which installed version 2.9.2. Then I did a import ray and then a ray.init(num_cpus=4) and it gave the following error message

>>> ray.init(num_cpus=4)
Traceback (most recent call last):
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\node.py", line 312, in __init__
    ray._private.services.wait_for_node(
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\services.py", line 463, in wait_for_node
    raise TimeoutError(
TimeoutError: Timed out after 60 seconds while waiting for node to startup. Did not find socket name tcp://127.0.0.1:63251 in the list of object store socket names.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\worker.py", line 1618, in init
    _global_node = ray._private.node.Node(
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\node.py", line 317, in __init__
    raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.

I exited the python environment and tried it again and it seemed to work with the following message

2024-02-21 15:29:34,099 INFO worker.py:1724 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.9.2', ray_commit='fce7a361807580953364e2da964f9498f3123bf9', protocol_version=None)

However if I do only ray.init() was leaving it at 2024-02-21 15:33:16,903 INFO worker.py:1724 -- Started a local Ray instance. indefinitely. So I did a ray.init(logging_level='debug') and this is what I got

2024-02-21 15:33:11,105 DEBUG worker.py:1483 -- Could not import resource module (on Windows)
2024-02-21 15:33:13,029 DEBUG node.py:1303 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\ray\session_2024-02-21_15-33-12_966592_144860\logs.
2024-02-21 15:33:16,404 DEBUG node.py:1332 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\ray\session_2024-02-21_15-33-12_966592_144860\logs.
2024-02-21 15:33:16,410 DEBUG npu.py:58 -- Could not import AscendCL: No module named 'acl'
2024-02-21 15:33:16,411 DEBUG tpu.py:119 -- Failed to detect number of TPUs: [WinError 3] The system cannot find the path specified: '/dev/vfio'
2024-02-21 15:33:16,624 DEBUG services.py:2064 -- Determine to start the Plasma object store with 92.53 GB memory using C:\Users\paula\AppData\Local\Temp.
2024-02-21 15:33:16,903 INFO worker.py:1724 -- Started a local Ray instance.
mattip commented 7 months ago

Where did you get python, ~how are you installing ray~? We have seen some problems with the python from the windows app store.

AvisP commented 6 months ago

I downloaded the installer for 3.10.11 from here. If there is another version that you know has no issues I can try that

mattip commented 6 months ago

No, that build should work well, but the first time you run it you will have to allow firewall exceptions via a giu pop-up box. Did you see that?

mattip commented 6 months ago

The last log snippet on your last attempt looks like it is all working: Started a local Ray instance.. Could you summarize for me where this all stands? I am confused as to what works and what doesn't.

albert-ying commented 4 months ago

I got a similar problem:

Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.24.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import ray

In [2]: ray.init(num_cpus=2, logging_level="debug")
2024-05-15 16:15:24,958 DEBUG node.py:293 -- Setting node ID to 2e3620292cacbc0db97e26bbb3f3fbd3e3733955eb3b7f3aa12b71fb
2024-05-15 16:15:24,959 DEBUG node.py:1344 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-24_957954_661582/logs.
2024-05-15 16:15:26,363 DEBUG node.py:1373 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-24_957954_661582/logs.
2024-05-15 16:15:29,428 DEBUG npu.py:60 -- Could not import AscendCL: No module named 'acl'
2024-05-15 16:15:29,430 DEBUG services.py:2077 -- Determine to start the Plasma object store with 78.2 GB memory using /dev/shm.
2024-05-15 16:15:30,588 INFO worker.py:1749 -- Started a local Ray instance.
Out[2]: RayContext(dashboard_url='', python_version='3.11.9', ray_version='2.22.0', ray_commit='a8ab7b87a59b833242fe3564c61d7b39265cd838')

In [3]: ray.shutdown()

2 cores works

In [4]: ray.init(num_cpus=5, logging_level="debug")
2024-05-15 16:15:50,060 DEBUG node.py:293 -- Setting node ID to 97a9a7a0d6b57dbb9c182dd042fa56b7aae8c9f04dd61fa1ac1b0a7c
2024-05-15 16:15:50,061 DEBUG node.py:1344 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-50_060016_661582/logs.
2024-05-15 16:15:51,436 DEBUG node.py:1373 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-50_060016_661582/logs.
2024-05-15 16:15:54,531 DEBUG npu.py:60 -- Could not import AscendCL: No module named 'acl'
2024-05-15 16:15:54,532 DEBUG services.py:2077 -- Determine to start the Plasma object store with 78.21 GB memory using /dev/shm.
2024-05-15 16:15:55,588 INFO worker.py:1749 -- Started a local Ray instance.

This hang forever.

mattip commented 4 months ago

@albert-ying for completeness: what machine are you running this on, how many CPUs does it have?

albert-ying commented 4 months ago

@mattip Arch Linux x86_64, 112 cores

zPz-002 commented 3 months ago

80 cores as seen on Task manager

Maybe connected to the large number of cores taking too long to spin up all the processes on windows. In order to test that hypothesis, could you try ray.init(num_cpus=N) and figure out how large N can be before the function fails? I do not have access to such a large machine.

I have a similar problem, and this is useful in my issue. But I think it's not friendly enough that it only shows "INFO worker.py:1794 -- Started a local Ray instance" when ray.init() is executing. How can I get more information to ensure the program is running normally?

mattip commented 3 months ago

How can I get more information

Either ray status or open the dashboard. But note #45578 which is meant to fix the dashboard reporting on windows

mattip commented 2 months ago

Closing this. Please reopen or open a new issue if the problem reappears

cometta commented 2 months ago

i also encountered this hang issue. i did test @albert-ying's method, test on python interpreter shell, by increasing the num_cpus, i don't get any error. does Started a local Ray instance needs to download anything from the internet, can this related to URL blocked?

mattip commented 2 months ago

@cometta This issue is closed, please open a new one with