Closed jarbus closed 1 year ago
cc @mehrdadn
This is expected as we've only worked on single-node support so far on Windows, but I'll add this to #9114 to track. Thanks for reporting!
To add to this point: I'm also facing this issue, but on Linux system. Where in Kubernetes cluster Ray cluster has been started, while trying to access it from another machine via python client which is in the same network, failing with W0701 19:00:24.483278 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
@kjoth Great to know, thanks for sharing. To clarify, are you on actual Linux, or on WSL?
@mehrdadn : Just an brief on the envs.
Getting the message in console W0629 17:55:35.521113 31524 253148608 redis_context.cc:307] Failed to connect to Redis, retrying.
But can access the redis via python client, it works.
There is an similar issue in github , where that person has come with work around by making network changes to make it accessible. https://github.com/ray-project/ray/issues/6108
Just an thought: May be it could be connection issue where Ray couldn't establish(send back) connection back to the client machine while client can communicate the Ray server/head.
@mehrdadn Do you want me to raise a separate issue for Linux (K8s)? or It will be covered in this issue itself.
@kjoth It's unclear to me if what you're facing is actually the same issue or not, so I'm not sure. "Failed to connect to Redis" is a fairly general error. But if it might be the same issue then we can keep it here. If it doesn't get resolved when this issue is fixed then we can open a separate issue for Linux.
@mehrdadn , Its same issue as above - where both of us facing the same issue, we have communicated on the Ray -slack channel. It's fair to see when their is fix for this issue resolves for Linux as well. Will wait till the issue is fixed.
What is the ETA on this issue?
@repelstiltskin I'm not sure we have an ETA right now; it may be some time before it's resolved. But if this affects you, please upvote the post. That helps the team prioritize work on different issues.
I'm facing this issue on single node Windows machine. I've opened up port 6379 for Redis but error still occurs. Any idea what else should I do?
I finally figured it out debugging through source code. It appears Redis server takes about 10 to 20 seconds to start for some reason. May be it's looking for something on network? I'm working on private home wifi so not sure why it is taking it so long. So if I change time.sleep(0.1)
to time.sleep(20)
in _start_redis_instance
method in ray/sevices.py
then things works.
Thanks for looking into this! 20 seconds sounds like a timeout somewhere. Not sure if this is relevant but GlobalState._initialize_global_state
has timeout=20
, and it tries to connect to Redis.
Having the same issue here:
import ray
ray.init(address='<ip>:<port>')
ist not able connect with the master
import redis
rd = redis.Redis(host='172.31.1.57', port=12408, db=0, password='5241590000000000')
rd.set('foo', 'bar')
rd.get('foo')
that works without a problem
I am not seeing the same behavior as mentioned above: setting time.sleep(20)
did not solve the issue, also the timout from global state does not have any affect, because for me it hangs up on calling self.global_state_accessor.connect()
in _initialize_global_state
in state.py
When checking the self.redis_client
is connected just before the GlobalStateAccessor is created by calling set and get on the object, I can confirm that self.redis_client is sucessfully connected
Setting bool_is_test_client
to True results in the following code before hanging up
I0914 16:33:43.296236 28816 28816 redis_client.cc:146] RedisClient connected.
I0914 16:33:43.296334 28816 28816 redis_gcs_client.cc:89] RedisGcsClient Connected.
When running the docker container on the same host, but still as two different docker containers the connection can be established
When running the master node inside docker with 'host' network I also see several more ports beeing used:
$ sudo lsof -i -P -n | grep LISTEN
/usr/loca 20265 root 3u IPv4 195249 0t0 TCP *:59822 (LISTEN)
/usr/loca 20265 root 6u IPv6 195254 0t0 TCP *:62151 (LISTEN)
$ sudo netstat -tulpn | grep LISTEN
tcp 0 0 0.0.0.0:59822 0.0.0.0:* LISTEN 20265/reporter.py -
tcp6 0 0 :::62151 :::* LISTEN 20265/reporter.py -
$ sudo lsof -i:62151
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
gcs_serve 20248 root 46u IPv6 208936 0t0 TCP localhost:40096->localhost:62151 (ESTABLISHED)
raylet 20264 root 61u IPv6 203955 0t0 TCP localhost:40098->localhost:62151 (ESTABLISHED)
/usr/loca 20265 root 6u IPv6 195254 0t0 TCP *:62151 (LISTEN)
/usr/loca 20265 root 10u IPv6 206328 0t0 TCP localhost:62151->localhost:40096 (ESTABLISHED)
/usr/loca 20265 root 13u IPv6 206330 0t0 TCP localhost:62151->localhost:40098 (ESTABLISHED)
Each time I start the docker they get different ports, is it possible that those ports - which are not defined as required ports are causing the problem?
Have anybody managed to solve the issue for Linux machine?
I met the same issue with Linux. Opening 6379 port on firewall did not resolve it. But shutting down the firewall fixed it. Seems more ports need to open. But I don't know which ones.
@zjutoe Double check it's not a granularity issue. Like ingoing vs outgoing vs forward vs etc. tables if you're on iptables
I am getting a similar error when running the aws cluster launcher. I get
redis_context.cc:335: Will retry in 100 milliseconds. Each retry takes about two minutes.
but I am able to connect from the client machine to redis using the redis cli without a problem so it doesn't seem to be a firewall issue.
Disabling my firewall completely worked for me also (for testing only of course.), so 6379 isn't the only port that needs opened.
I am getting a similar error when running the aws cluster launcher. I get
redis_context.cc:335: Will retry in 100 milliseconds. Each retry takes about two minutes.
but I am able to connect from the client machine to redis using the redis cli without a problem so it doesn't seem to be a firewall issue.
I got around this using ray client: https://docs.ray.io/en/master/ray-client.html - you connect on port 10001 (or you can set this to another port on ray start as per the docs.). So make sure your firewall provides access on that.
If there are people who still struggle, did you guys try opening all ports specified here? https://docs.ray.io/en/latest/configure.html#ports-configurations
Ray requires bi-directional communication among its components, which means those components ports must be all open.
Just an update:
However this is a really special use case, but maybe someone can make use of this info :smiley:
This issue seems to have collected a number of different setups, all with the same error but with no clear common denominator. It is not clear what they share, but there are hints around networking problems and open ports. Since redis is no longer required, I will close this. If network issues recur, we should try to figure out if firewalls are involved.
Please open a new issue and describe your system's network if you run into this again.
What is the problem?
Problem summary: We can connect to a redis server located on a head node from a non-head computer via python, but ray throws a redis connection error when it tries to connect from said non-head computer.
Related thread: https://github.com/ray-project/ray/issues/6900 The downgraded version of
psutil
mentioned does not solve our issue.(commands on Head and Worker denoted by
H$
andW$
respectively)Some notes:
create_redis_client()
inray/services.py
ray/state.py:87 self.global_state_accessor.connect()
failing to return. Raylet code is called from this point on.Issue https://github.com/ray-project/ray/issues/9135 has this same javascript error.
Ray version and other system information (Python version, TensorFlow version, OS):
Two computers:
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
(commands on Head and Worker denoted by
H$
andW$
respectively)