ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.68k stars 5.72k forks source link

Ray Cluster: Failed to create a ray cluster using running container #45148

Open hahmad2008 opened 5 months ago

hahmad2008 commented 5 months ago

What happened + What you expected to happen

I am using ray==2.9.2inside a running container, so I need to create a cluster using the following command:

docker exec -it MY_CONTAINER ray start --head --object-manager-port=8076 --node-manager-port=8077 Then I got message that it successfully created for the head cluster node. however then when I tried to check the cluster status:

docker exec -it MY_CONTAINER ray status

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 3168, in ray._raylet.check_health
  File "python/ray/_raylet.pyx", line 580, in ray._raylet.check_status
ray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:11.1.1.111:6379: Failed to connect to remote host: Connection refused

What is the problem here?

Versions / Dependencies

ray==2.9.2

Reproduction script

I am using ray==2.9.2inside a running container, so I need to create a cluster using the following command:

docker exec -it MY_CONTAINER ray start --head --object-manager-port=8076 --node-manager-port=8077 Then I got message that it successfully created for the head cluster node. however then when I tried to check the cluster status:

docker exec -it MY_CONTAINER ray status

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 3168, in ray._raylet.check_health
  File "python/ray/_raylet.pyx", line 580, in ray._raylet.check_status
ray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:11.1.1.111:6379: Failed to connect to remote host: Connection refused

What is the problem here?

Issue Severity

High: It blocks me from completing my task.

jjyao commented 5 months ago

Kuberay (https://github.com/ray-project/kuberay) is the recommended way to run Ray cluster inside container and k8s. Can you try that?