ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.7k stars 5.73k forks source link

"global_state_accessor.cc:539: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?" #47925

Open ianmaddox opened 3 weeks ago

ianmaddox commented 3 weeks ago

What happened + What you expected to happen

I'm having significant difficulty connecting to any ray head server. I've launched a server built from pip at version 2.37.0 and when that didn't work I launched another using the AWS "ray up" approach in the documentation which launched version 2.30.0. In both cases, I get the following error when trying to connect with a client:

Can't find a `node_ip_address.json` file from /tmp/ray/session_2024-10-07_21-57-36_582530_3941. Have you started Ray instance using `ray start` or `ray.init`?

I found that I can navigate to the path specified and create node_ip_address.json and put the IP of the head server in there to get past this error:

{"node_ip_address": "my.head.node.ip"}

However after that is resolved I get stuck on the following error and can't get past it:

global_state_accessor.cc:539: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?

This message is printed repeatedly for 30 seconds with no way to cancel or abort unless you kill the process.

I've tried connecting with cleanly installed clients on multiple machines at both version 2.37.0 and 2.30.0 from Ubuntu boxes with Python 3.10.12.

I've confirmed that the dashboard on both head instances I've launched works fine. Port 6379 is open. The firewall does not restrict any traffic between the local node and head machine.

Versions / Dependencies

Ubuntu 20.22 and 20.24 Python 3.10.12 Ray head 2.30.0 and 2.37.0 Ray client node 2.30.0 and 2.37.0

Reproduction script

This is the test script I'm using:

import ray

# Connect to the Ray head node on the EC2 instance
ray.init(address='ray.myhost.com:6379', _temp_dir='/tmp/ray/')

@ray.remote
def square(x):
    return x * x

# Run tasks on the remote Ray cluster
futures = [square.remote(i) for i in range(10)]
results = ray.get(futures)

print("Results from remote Ray cluster:", results)

Issue Severity

High: It blocks me from completing my task.

jjyao commented 1 week ago

Hi @ianmaddox are you using ray client? You are recommended to use Ray job submission (https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html). If you want to use ray client, make sure you provide the right address (https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html). 6379 is the GCS port not ray client port.