ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.42k stars 5.67k forks source link

Ray node fails to connect to head, claims redis error despite redis connection working #9259

Closed jarbus closed 1 year ago

jarbus commented 4 years ago

What is the problem?

Problem summary: We can connect to a redis server located on a head node from a non-head computer via python, but ray throws a redis connection error when it tries to connect from said non-head computer.

Related thread: https://github.com/ray-project/ray/issues/6900 The downgraded version of psutil mentioned does not solve our issue.

(commands on Head and Worker denoted by H$ and W$ respectively)

H$ ray start --head
2020-07-01 11:32:35,976 INFO scripts.py:394 -- Using IP address 192.168.1.13 for this node.
2020-07-01 11:32:36,011 INFO resource_spec.py:204 -- Starting Ray with 6.84 GiB memory available for workers and up to 3.44 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-01 11:32:36,646 INFO services.py:1163 -- View the Ray dashboard at localhost:8265
2020-07-01 11:32:36,733 INFO scripts.py:410 --
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --address='192.168.1.13:6379' --redis-password='5241590000000000'

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(address='auto', redis_password='5241590000000000')

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop

W$ ray start --address='192.168.1.13:6379' --redis-password='5241590000000000'
2020-07-01 11:35:45,744 INFO scripts.py:467 -- Using IP address 192.168.1.216 for this node.
2020-07-01 11:35:45,816 INFO resource_spec.py:204 -- Starting Ray with 6.64 GiB memory available for workers and up to 2.85 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-01 11:35:45,911 INFO scripts.py:477 --
Started Ray on this node. If you wish to terminate the processes that have been started, run

    ray stop

W$ nc -vz 192.168.1.13 6379
Connection to 192.168.1.13 6379 port [tcp/*] succeeded!   (<-- this is a redis test)

W$ ray timeline
2020-07-01 19:00:24,445 INFO scripts.py:1036 -- Connecting to Ray instance at 192.168.1.13:6379.                                                                                                                                             
WARNING: Logging before InitGoogleLogging() is written to STDERR                                                                                                                                                                             
I0701 19:00:24.472828  2152  2152 global_state_accessor.cc:25] Redis server address = 192.168.1.13:6379, is test flag = 0
W0701 19:00:24.478263  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
...
W0701 19:00:24.483278  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.483487  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.483750  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484005  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484279  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484525  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484755  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484956  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485133  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485313  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485491  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485673  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485865  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486061  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486239  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486409  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486563  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486718  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486873  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487048  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487211  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487366  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487530  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487686  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487840  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487994  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.488168  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
F0701 19:00:24.488487  2152  2152 redis_context.cc:302] Could not establish connection to redis 192.168.1.13:6379 (context.err = 1)                                                                                                          
*** Check failure stack trace: ***                                           
Aborted (core dumped) 

Some notes:

Ray version and other system information (Python version, TensorFlow version, OS):

Two computers:

Server2019 (HEAD, H):
  OS: Windows Server 2019 Version 1809 Build 17763.1282
  WSL- Ubuntu 20.04
  ray 0.8.6
  Python 3.8.2
  Tensorflow 2.2.0

Desktop-GTPUF8 (WORKER, W):
  OS: Windows 10 Version 1909 Build 18363.900
  WSL- Ubuntu 20.04
  ray 0.8.6 
  Python 3.8.2
  Tensorflow 2.2.0

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

(commands on Head and Worker denoted by H$ and W$ respectively)

H$ ray start --head
W$ ray start --address='REDIS_ADDR_FROM_PREV_COMMAND' --redis-password='5241590000000000'
W$ ray timeline
richardliaw commented 4 years ago

cc @mehrdadn

mehrdadn commented 4 years ago

This is expected as we've only worked on single-node support so far on Windows, but I'll add this to #9114 to track. Thanks for reporting!

kjoth commented 4 years ago

To add to this point: I'm also facing this issue, but on Linux system. Where in Kubernetes cluster Ray cluster has been started, while trying to access it from another machine via python client which is in the same network, failing with W0701 19:00:24.483278 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.

mehrdadn commented 4 years ago

@kjoth Great to know, thanks for sharing. To clarify, are you on actual Linux, or on WSL?

kjoth commented 4 years ago

@mehrdadn : Just an brief on the envs.

  1. Ray 0.8.6 - Autoscaler - example-full.yaml in Kubernetes cluster. (Linux)
  2. Trying to access it via another VM(Linux - not a part of k8 cluster) which is in the same network- via python client. import ray ray.init(address=':<exposed_redisport' -6379>, redis_password='5241590000000000')

Getting the message in console W0629 17:55:35.521113 31524 253148608 redis_context.cc:307] Failed to connect to Redis, retrying.

But can access the redis via python client, it works.

image

There is an similar issue in github , where that person has come with work around by making network changes to make it accessible. https://github.com/ray-project/ray/issues/6108

Just an thought: May be it could be connection issue where Ray couldn't establish(send back) connection back to the client machine while client can communicate the Ray server/head.

kjoth commented 4 years ago

@mehrdadn Do you want me to raise a separate issue for Linux (K8s)? or It will be covered in this issue itself.

mehrdadn commented 4 years ago

@kjoth It's unclear to me if what you're facing is actually the same issue or not, so I'm not sure. "Failed to connect to Redis" is a fairly general error. But if it might be the same issue then we can keep it here. If it doesn't get resolved when this issue is fixed then we can open a separate issue for Linux.

kjoth commented 4 years ago

@mehrdadn , Its same issue as above - where both of us facing the same issue, we have communicated on the Ray -slack channel. It's fair to see when their is fix for this issue resolves for Linux as well. Will wait till the issue is fixed.

repelstiltskin commented 4 years ago

What is the ETA on this issue?

mehrdadn commented 4 years ago

@repelstiltskin I'm not sure we have an ETA right now; it may be some time before it's resolved. But if this affects you, please upvote the post. That helps the team prioritize work on different issues.

sytelus commented 4 years ago

I'm facing this issue on single node Windows machine. I've opened up port 6379 for Redis but error still occurs. Any idea what else should I do?

sytelus commented 4 years ago

I finally figured it out debugging through source code. It appears Redis server takes about 10 to 20 seconds to start for some reason. May be it's looking for something on network? I'm working on private home wifi so not sure why it is taking it so long. So if I change time.sleep(0.1) to time.sleep(20) in _start_redis_instance method in ray/sevices.py then things works.

mehrdadn commented 4 years ago

Thanks for looking into this! 20 seconds sounds like a timeout somewhere. Not sure if this is relevant but GlobalState._initialize_global_state has timeout=20, and it tries to connect to Redis.

TanjaBayer commented 4 years ago

Having the same issue here:

Setup

TanjaBayer commented 4 years ago

I am not seeing the same behavior as mentioned above: setting time.sleep(20) did not solve the issue, also the timout from global state does not have any affect, because for me it hangs up on calling self.global_state_accessor.connect() in _initialize_global_state in state.py

When checking the self.redis_client is connected just before the GlobalStateAccessor is created by calling set and get on the object, I can confirm that self.redis_client is sucessfully connected

Setting bool_is_test_client to True results in the following code before hanging up

I0914 16:33:43.296236 28816 28816 redis_client.cc:146] RedisClient connected.
I0914 16:33:43.296334 28816 28816 redis_gcs_client.cc:89] RedisGcsClient Connected.

When running the docker container on the same host, but still as two different docker containers the connection can be established

TanjaBayer commented 4 years ago

Host Network

$ sudo netstat -tulpn | grep LISTEN
tcp        0      0 0.0.0.0:59822           0.0.0.0:*               LISTEN      20265/reporter.py - 
tcp6       0      0 :::62151                :::*                    LISTEN      20265/reporter.py - 
$ sudo lsof -i:62151
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
gcs_serve 20248 root   46u  IPv6 208936      0t0  TCP localhost:40096->localhost:62151 (ESTABLISHED)
raylet    20264 root   61u  IPv6 203955      0t0  TCP localhost:40098->localhost:62151 (ESTABLISHED)
/usr/loca 20265 root    6u  IPv6 195254      0t0  TCP *:62151 (LISTEN)
/usr/loca 20265 root   10u  IPv6 206328      0t0  TCP localhost:62151->localhost:40096 (ESTABLISHED)
/usr/loca 20265 root   13u  IPv6 206330      0t0  TCP localhost:62151->localhost:40098 (ESTABLISHED)

Each time I start the docker they get different ports, is it possible that those ports - which are not defined as required ports are causing the problem?

MuhammadSYahyaS commented 4 years ago

Have anybody managed to solve the issue for Linux machine?

zjutoe commented 3 years ago

I met the same issue with Linux. Opening 6379 port on firewall did not resolve it. But shutting down the firewall fixed it. Seems more ports need to open. But I don't know which ones.

mehrdadn commented 3 years ago

@zjutoe Double check it's not a granularity issue. Like ingoing vs outgoing vs forward vs etc. tables if you're on iptables

tkram01 commented 3 years ago

I am getting a similar error when running the aws cluster launcher. I get

redis_context.cc:335: Will retry in 100 milliseconds. Each retry takes about two minutes.

but I am able to connect from the client machine to redis using the redis cli without a problem so it doesn't seem to be a firewall issue.

fastlaner commented 3 years ago

Disabling my firewall completely worked for me also (for testing only of course.), so 6379 isn't the only port that needs opened.

PeadarOhAodha commented 3 years ago

I am getting a similar error when running the aws cluster launcher. I get

redis_context.cc:335: Will retry in 100 milliseconds. Each retry takes about two minutes.

but I am able to connect from the client machine to redis using the redis cli without a problem so it doesn't seem to be a firewall issue.

I got around this using ray client: https://docs.ray.io/en/master/ray-client.html - you connect on port 10001 (or you can set this to another port on ray start as per the docs.). So make sure your firewall provides access on that.

rkooo567 commented 3 years ago

If there are people who still struggle, did you guys try opening all ports specified here? https://docs.ray.io/en/latest/configure.html#ports-configurations

Ray requires bi-directional communication among its components, which means those components ports must be all open.

TanjaBayer commented 3 years ago

Just an update:

However this is a really special use case, but maybe someone can make use of this info :smiley:

mattip commented 1 year ago

This issue seems to have collected a number of different setups, all with the same error but with no clear common denominator. It is not clear what they share, but there are hints around networking problems and open ports. Since redis is no longer required, I will close this. If network issues recur, we should try to figure out if firewalls are involved.

Please open a new issue and describe your system's network if you run into this again.