Open CansuCandan opened 1 year ago
Did you make sure all the ports are open based on https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations?
Normally this happens because of connectivity issues.
@CansuCandan - please confirm if your system has the appropriate connectivity?
Hi, @hora-anyscale and @rkooo567
Many thanks for your answer.
Yes, I realized that this problem is due to connection problems.
Our security team opened ports based on https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations But that wasn't enough.
The reason the centos machine cannot be added to the cluster is because the centos machine is in a different WLAN, the other machines are in the same wlan. That's why it gets stuck in port restrictions between them due to the firewall.
Linux Machines: x.x.x.x => Head Node => Centos x.x.x.x => Worker Node => Ubuntu x.x.x.x => Worker Node => Ubuntu
However, when I examine the Source and Destination ip and ports, the destination ports added as the worker node are constantly changing. Even if the main ports used by Ray are opened between these three machines, we realized that it still creates a problem because the ports on the ips added as the worker node have changed.
When I checked with netstat -tulnap | grep worker_node_ip
And I stopped ray cluster and I set again on same machines.
When I checked with netstat -tulnap | grep worker_node_ip
As a result, when all ports are open between these three machines, machines in other wlans can be added to the same cluster. For security reasons, we want all ports not to be open between these three machines. Even if we open only the ports used by ray (https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations), the centos machine couldnt be added to the cluster.
Thank you.
Aren't ports changed because they are between worker port range?
What happened + What you expected to happen
Hi,
I have some issues. I don’t know this is a bug or not. Please notify me about this issue. I am setting up cluster. Firstly, I set Centos machine as head node, worker node1 Ubuntu, worker node2 also ubuntu. But when deployed code which simple consume resource, Centos machine not consume of resource. Also when I checked as ray status, Centos Cpu resource not adding to Head Node.
This scenario like also same between centos machines. (But Centos CPU resource not adding to cluster)
I could just connected CPU resources between ubuntu 18.04 Linux distribution.
Ray version = 2.1.0 Python version=3.8.8
Ray and Python versions same in the all machines.
Why this happening?
Please let me know.
Machines:
1.Cluster structure ( This worked) Linux Machines: x.x.x.x => Head Node => Ubuntu 18.04.4 LTS
x.x.x.x => Worker Node => Ubuntu 18.04.4 LTS x.x.x.x => Worker Node => Ubuntu 18.04.4 LTS
2.Cluster structure(On this scenario, Ubuntu CPU resource adding to cluster. But Centos CPU resource not adding to cluster) Linux Machines: x.x.x.x => Head Node => Centos x.x.x.x => Worker Node => Ubuntu x.x.x.x => Worker Node => Ubuntu
3.Cluster structure (On this scenario, only the head node is added to the cluster. CPU of other machines are not adding to the cluster ) Linux Machines: x.x.x.x => Head Node => CentOS-7 x.x.x.x => Worker Node => CentOS-7 x.x.x.x => Worker Node => CentOS-7
Versions / Dependencies
1.Cluster structure ( This worked) Linux Machines: x.x.x.x => Head Node => Ubuntu 18.04.4 LTS x.x.x.x => Worker Node => Ubuntu 18.04.4 LTS x.x.x.x => Worker Node => Ubuntu 18.04.4 LTS Python Version = 3.8.8 Ray version = 2.1.0
2.Cluster structure(On this scenario, Ubuntu CPU resource adding to cluster. But Centos CPU resource not adding to cluster) Linux Machines: x.x.x.x => Head Node => Centos x.x.x.x => Worker Node => Ubuntu x.x.x.x => Worker Node => Ubuntu Python Version = 3.8.8 Ray version = 2.1.0
3.Cluster structure (On this scenario, only the head node is added to the cluster. CPU of other machines are not adding to the cluster ) Linux Machines: x.x.x.x => Head Node => CentOS-7 x.x.x.x => Worker Node => CentOS-7 x.x.x.x => Worker Node => CentOS-7 Python Version = 3.8.8 Ray version = 2.1.0
Reproduction script
Python code which simple consume resource:
import ray import time import numpy as np
ray.init(address='ray://xxxx:10001', runtime_env={'pip':['numpy']}) # xxxx IP is Head Node IP
print(ray.cluster_resources())
@ray.remote def ray_task(id): np.random.randn(10_000, 10_000) * np.random.randn(10_000, 10_000) print(f"done: {id}")
start = time.time() ray.get([ray_task.remote(i+1) for i in range(164)]) print(f"elapsed time: {time.time()-start}") ray.shutdown()
Scripts
For head node: ray start --head --port=6379 --object-manager-port=8076 --include-dashboard=true --dashboard-host=0.0.0.0 --dashboard-port=9090 For Worker Nodes: ray start --address='x.x.x.x:6379' # xxxx IP is Head Node IP. I copied from "To connect to this Ray runtime from another node, run" step.
After run "python ray_test.py" I run the "top" command in the Centos machines, I can't see ray:task. But I can see in the Ubuntu machines.
Issue Severity
High: It blocks me from completing my task.