ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.25k stars 5.63k forks source link

Ray Cluster Resources Issue #30780

Open CansuCandan opened 1 year ago

CansuCandan commented 1 year ago

What happened + What you expected to happen

Hi,

I have some issues. I don’t know this is a bug or not. Please notify me about this issue. I am setting up cluster. Firstly, I set Centos machine as head node, worker node1 Ubuntu, worker node2 also ubuntu. But when deployed code which simple consume resource, Centos machine not consume of resource. Also when I checked as ray status, Centos Cpu resource not adding to Head Node.

This scenario like also same between centos machines. (But Centos CPU resource not adding to cluster)

I could just connected CPU resources between ubuntu 18.04 Linux distribution.

Ray version = 2.1.0 Python version=3.8.8

Ray and Python versions same in the all machines.

Why this happening?

Please let me know.

Machines:

1.Cluster structure ( This worked) Linux Machines: x.x.x.x => Head Node => Ubuntu 18.04.4 LTS
x.x.x.x => Worker Node => Ubuntu 18.04.4 LTS x.x.x.x => Worker Node => Ubuntu 18.04.4 LTS

2.Cluster structure(On this scenario, Ubuntu CPU resource adding to cluster. But Centos CPU resource not adding to cluster) Linux Machines: x.x.x.x => Head Node => Centos x.x.x.x => Worker Node => Ubuntu x.x.x.x => Worker Node => Ubuntu

3.Cluster structure (On this scenario, only the head node is added to the cluster. CPU of other machines are not adding to the cluster ) Linux Machines: x.x.x.x => Head Node => CentOS-7 x.x.x.x => Worker Node => CentOS-7 x.x.x.x => Worker Node => CentOS-7

Versions / Dependencies

1.Cluster structure ( This worked) Linux Machines: x.x.x.x => Head Node => Ubuntu 18.04.4 LTS x.x.x.x => Worker Node => Ubuntu 18.04.4 LTS x.x.x.x => Worker Node => Ubuntu 18.04.4 LTS Python Version = 3.8.8 Ray version = 2.1.0

2.Cluster structure(On this scenario, Ubuntu CPU resource adding to cluster. But Centos CPU resource not adding to cluster) Linux Machines: x.x.x.x => Head Node => Centos x.x.x.x => Worker Node => Ubuntu x.x.x.x => Worker Node => Ubuntu Python Version = 3.8.8 Ray version = 2.1.0

3.Cluster structure (On this scenario, only the head node is added to the cluster. CPU of other machines are not adding to the cluster ) Linux Machines: x.x.x.x => Head Node => CentOS-7 x.x.x.x => Worker Node => CentOS-7 x.x.x.x => Worker Node => CentOS-7 Python Version = 3.8.8 Ray version = 2.1.0

Reproduction script

Python code which simple consume resource:

import ray import time import numpy as np

ray.init(address='ray://xxxx:10001', runtime_env={'pip':['numpy']}) # xxxx IP is Head Node IP

print(ray.cluster_resources())

@ray.remote def ray_task(id): np.random.randn(10_000, 10_000) * np.random.randn(10_000, 10_000) print(f"done: {id}")

start = time.time() ray.get([ray_task.remote(i+1) for i in range(164)]) print(f"elapsed time: {time.time()-start}") ray.shutdown()

Scripts

For head node: ray start --head --port=6379 --object-manager-port=8076 --include-dashboard=true --dashboard-host=0.0.0.0 --dashboard-port=9090 For Worker Nodes: ray start --address='x.x.x.x:6379' # xxxx IP is Head Node IP. I copied from "To connect to this Ray runtime from another node, run" step.

After run "python ray_test.py" I run the "top" command in the Centos machines, I can't see ray:task. But I can see in the Ubuntu machines.

Issue Severity

High: It blocks me from completing my task.

rkooo567 commented 1 year ago

Did you make sure all the ports are open based on https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations?

rkooo567 commented 1 year ago

Normally this happens because of connectivity issues.

hora-anyscale commented 1 year ago

@CansuCandan - please confirm if your system has the appropriate connectivity?

CansuCandan commented 1 year ago

Hi, @hora-anyscale and @rkooo567

Many thanks for your answer.

Yes, I realized that this problem is due to connection problems.

Our security team opened ports based on https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations But that wasn't enough.

The reason the centos machine cannot be added to the cluster is because the centos machine is in a different WLAN, the other machines are in the same wlan. That's why it gets stuck in port restrictions between them due to the firewall.

Linux Machines: x.x.x.x => Head Node => Centos x.x.x.x => Worker Node => Ubuntu x.x.x.x => Worker Node => Ubuntu

However, when I examine the Source and Destination ip and ports, the destination ports added as the worker node are constantly changing. Even if the main ports used by Ray are opened between these three machines, we realized that it still creates a problem because the ports on the ips added as the worker node have changed.

When I checked with netstat -tulnap | grep worker_node_ip image

And I stopped ray cluster and I set again on same machines. When I checked with netstat -tulnap | grep worker_node_ip image

As a result, when all ports are open between these three machines, machines in other wlans can be added to the same cluster. For security reasons, we want all ports not to be open between these three machines. Even if we open only the ports used by ray (https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations), the centos machine couldnt be added to the cluster.

Thank you.

rkooo567 commented 1 year ago
Screen Shot 2023-01-11 at 6 56 45 AM

Aren't ports changed because they are between worker port range?