ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34k stars 5.78k forks source link

Redis connection errors when calling from ray.init [tune] #15780

Closed PostmanSpat closed 2 years ago

PostmanSpat commented 3 years ago

What is the problem?

Trying to set up a basic environment to use TensorTrade (TensorFlow) and ray[tune], but I get the following error when trying to connect to redis calling ray.init: ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

Redis is set up, I've configured the password, and I can connect ok using redis-cli.

I've put a monitor on the redis server, and I can see that ray connects initially, but then the stops:

1620711360.767056 [0 127.0.0.1:57493] "AUTH" "secret"
1620711360.769051 [0 127.0.0.1:57494] "AUTH" "secret"
1620711360.770909 [0 127.0.0.1:57494] "SET" "redis_start_time" "1620711360.7703528"

I tracked the code through and found this in services.py:

def address_to_ip(address):
...
    # Make sure localhost isn't resolved to the loopback ip
    if ip_address == "127.0.0.1":
        ip_address = get_node_ip_address()
    return ":".join([ip_address] + address_parts[1:])

It seems that even though I pass in the IP of 127.0.0.1, this code converts it back to 192.168.20.13. It seems that it will connect on 127.. address ok, but not on 192.. address. Unfortunately, the system I am running is controlled by a group policy and I cannot turn off the Windows firewall completely. I can telnet to redis on 127.., but I can't telnet on 192.. When I installed redis it added firewall rules, but I think the group policy might still prevent it from opening on 192..

So I commented out these two lines of code from address_to_ip:

    #if ip_address == "127.0.0.1":
    #    ip_address = get_node_ip_address()

Then when I run, I get this error:

2021-05-12 20:14:12,212    INFO worker.py:663 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379

...
  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\node.py", line 48, in _get_with_retry
    raise RuntimeError(f"Could not read '{key}' from GCS (redis). "

RuntimeError: Could not read 'session_name' from GCS (redis). Has redis started correctly on the head node?

I'm assuming that it is something to do with the group policies in my system preventing me from enabling access on 192.., so I'm happy to do my testing with the two lines of code commented out to force the connection to use 127.. But it would be nice if I could just do that through configuration.

However, now with the "Could not read 'session_name'" error, I'm stuck. I don't know if it is related to the 127.. change, or something else.

I also tried taking out the 127.. address from ray.init(), but then I got this error:

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))

ConnectionError: Error 10061 connecting to 127.0.0.1:17091. No connection could be made because the target machine actively refused it.

...

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\_private\services.py", line 670, in wait_for_redis_to_start
    raise RuntimeError(

RuntimeError: Unable to connect to Redis at 127.0.0.1:17091 after 12 retries. Check that 127.0.0.1:17091 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable `RAY_START_REDIS_WAIT_RETRIES` to increase the number of attempts to ping the Redis server.

Where did port 17091 come from?

I've been discussing this on the redis Discord channel, and they've helped me reach this far of the investigation. But now they suggested I log an issue here.

Ray version and other system information (Python version, TensorFlow version, OS): Python 3.8 Windows 10 x64 Everything else was fresh pip installs this week.

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

import ray
ray.init(address="127.0.0.1:6379", _redis_password="foo!bared")

Full stack trace:

  File "C:\Users\me\Desktop\Junk\Python\PY\untitled0.py", line 2, in <module>
    ray.init(_redis_password="foo!bared")

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray_private\client_mode_hook.py", line 62, in wrapper
    return func(*args, kwargs)

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\worker.py", line 730, in init
    _global_node = ray.node.Node(

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\node.py", line 230, in init
    self.start_head_processes()

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\node.py", line 860, in start_head_processes
    self.start_redis()

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\node.py", line 675, in start_redis
    process_infos) = ray._private.services.start_redis(

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray_private\services.py", line 881, in start_redis
    primary_redis_client.set("NumRedisShards", str(num_redis_shards))

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\client.py", line 1801, in set
    return self.execute_command('SET', *pieces)

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, options)

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\connection.py", line 1192, in get_connection
    connection.connect()

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))

ConnectionError: Error 10061 connecting to 192.168.20.13:6379. No connection could be made because the target machine actively refused it.
  File "c:\users\me\desktop\junk\python\py\untitled0.py", line 7, in <module>
    ray.init(address="127.0.0.1:6379", _redis_password="foo!bared")

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\_private\client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\worker.py", line 767, in init
    _global_node = ray.node.Node(

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\node.py", line 163, in __init__
    session_name = _get_with_retry(redis_client, "session_name")

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\node.py", line 48, in _get_with_retry
    raise RuntimeError(f"Could not read '{key}' from GCS (redis). "

RuntimeError: Could not read 'session_name' from GCS (redis). Has redis started correctly on the head node?
  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\connection.py", line 559, in connect
    sock = self._connect()

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\connection.py", line 615, in _connect
    raise err

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\connection.py", line 603, in _connect
    sock.connect(socket_address)

ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\_private\services.py", line 656, in wait_for_redis_to_start
    redis_client.client_list()

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\client.py", line 1194, in client_list
    return self.execute_command('CLIENT LIST')

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\connection.py", line 1192, in get_connection
    connection.connect()

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\redis\connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))

ConnectionError: Error 10061 connecting to 127.0.0.1:17091. No connection could be made because the target machine actively refused it.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "c:\users\me\desktop\junk\python\py\untitled0.py", line 7, in <module>
    ray.init(_redis_password="foo!bared")

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\_private\client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\worker.py", line 730, in init
    _global_node = ray.node.Node(

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\node.py", line 230, in __init__
    self.start_head_processes()

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\node.py", line 860, in start_head_processes
    self.start_redis()

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\node.py", line 675, in start_redis
    process_infos) = ray._private.services.start_redis(

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\_private\services.py", line 917, in start_redis
    redis_shard_port, p = _start_redis_instance(

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\_private\services.py", line 1028, in _start_redis_instance
    wait_for_redis_to_start("127.0.0.1", port, password=password)

  File "C:\Users\me\Anaconda3\envs\keras\lib\site-packages\ray\_private\services.py", line 670, in wait_for_redis_to_start
    raise RuntimeError(

RuntimeError: Unable to connect to Redis at 127.0.0.1:17091 after 12 retries. Check that 127.0.0.1:17091 is reachable from this machine. If it is not, your firewall may be blocking this port. If the problem is a flaky connection, try setting the environment variable `RAY_START_REDIS_WAIT_RETRIES` to increase the number of attempts to ping the Redis server.
krfricke commented 3 years ago

This seems to be a windows-related error (rather than Ray Tune)

cc @wuisawesome do you know who is currently responsible for windows builds?

richardliaw commented 3 years ago

cc @fcardoso75 do you have any tips about this?

fcardoso75 commented 3 years ago

I can reproduce it:

>>> import ray
d:\anyscale\ray\python\ray\autoscaler\_private\cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(
>>> ray.init(address="127.0.0.1:6379")
2021-06-14 17:35:18,662 INFO worker.py:733 -- Connecting to existing Ray cluster at address: 192.168.0.197:6379
Traceback (most recent call last):
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 559, in connect
    sock = self._connect()
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 615, in _connect
    raise err
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\anyscale\ray\python\ray\_private\client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "d:\anyscale\ray\python\ray\worker.py", line 837, in init
    _global_node = ray.node.Node(
  File "d:\anyscale\ray\python\ray\node.py", line 163, in __init__
    session_name = _get_with_retry(redis_client, "session_name")
  File "d:\anyscale\ray\python\ray\node.py", line 41, in _get_with_retry
    result = redis_client.get(key)
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\client.py", line 1606, in get
    return self.execute_command('GET', name)
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 1192, in get_connection
    connection.connect()
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 10061 connecting to 192.168.0.197:6379. No connection could be made because the target machine actively refused it.
>>> ray.init(address="127.0.0.1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\anyscale\ray\python\ray\_private\client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "d:\anyscale\ray\python\ray\worker.py", line 725, in init
    redis_address, _, _ = services.validate_redis_address(address)
  File "d:\anyscale\ray\python\ray\_private\services.py", line 409, in validate_redis_address
    raise ValueError("Malformed address. Expected '<host>:<port>'.")
ValueError: Malformed address. Expected '<host>:<port>'.
>>> ray.init()
2021-06-14 17:35:42,799 INFO services.py:1315 -- View the Ray dashboard at http://127.0.0.1:8265
{'node_ip_address': '192.168.0.197', 'raylet_ip_address': '192.168.0.197', 'redis_address': '192.168.0.197:6379', 'object_store_address': 'tcp://127.0.0.1:25331', 'raylet_socket_name': 'tcp://127.0.0.1:9691', 'webui_url': '127.0.0.1:8265', 'session_dir': 'C:\\Users\\Fabiano\\AppData\\Local\\Temp\\ray\\session_2021-06-14_17-35-37_775383_3708', 'metrics_export_port': 53209, 'node_id': '041af3962168a633269dea9b8922b0c0335adecf7821e88dda6f9e28'}
>>> (pid=None) d:\anyscale\ray\python\ray\autoscaler\_private\cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
(pid=None)   warnings.warn(

>>> (pid=None) d:\anyscale\ray\python\ray\autoscaler\_private\cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
(pid=None)   warnings.warn(
@ray.remote
... def f():
...   return "Hello"
...
>>> ray.get(f.remote())
'Hello'
>>> quit()
fcardoso75 commented 3 years ago

Also:

>>> import ray
d:\anyscale\ray\python\ray\autoscaler\_private\cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(
>>> ray.init(address="192.168.0.197:6379")
2021-06-14 17:37:37,501 INFO worker.py:733 -- Connecting to existing Ray cluster at address: 192.168.0.197:6379
Traceback (most recent call last):
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 559, in connect
    sock = self._connect()
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 615, in _connect
    raise err
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\anyscale\ray\python\ray\_private\client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "d:\anyscale\ray\python\ray\worker.py", line 837, in init
    _global_node = ray.node.Node(
  File "d:\anyscale\ray\python\ray\node.py", line 163, in __init__
    session_name = _get_with_retry(redis_client, "session_name")
  File "d:\anyscale\ray\python\ray\node.py", line 41, in _get_with_retry
    result = redis_client.get(key)
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\client.py", line 1606, in get
    return self.execute_command('GET', name)
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 1192, in get_connection
    connection.connect()
  File "C:\ProgramData\Miniconda3\lib\site-packages\redis\connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 10061 connecting to 192.168.0.197:6379. No connection could be made because the target machine actively refused it.
>>> ray.init()
2021-06-14 17:46:40,334 INFO services.py:1315 -- View the Ray dashboard at http://127.0.0.1:8265
{'node_ip_address': '192.168.0.197', 'raylet_ip_address': '192.168.0.197', 'redis_address': '192.168.0.197:6379', 'object_store_address': 'tcp://127.0.0.1:25569', 'raylet_socket_name': 'tcp://127.0.0.1:22893', 'webui_url': '127.0.0.1:8265', 'session_dir': 'C:\\Users\\Fabiano\\AppData\\Local\\Temp\\ray\\session_2021-06-14_17-46-35_279617_15616', 'metrics_export_port': 32616, 'node_id': '690eede43a036c64480cf6be71e9f68d734055b20735c9b0397444cf'}
>>>
>>>
>>> @ray.remote
... def f():
...   return "Hello"
...
>>> ray.get(f.remote())
'Hello'
>>> quit()
richardliaw commented 3 years ago

Hmm, ok. @PostmanSpat are you using an external redis?

Ray should be handling its own redis server.

PostmanSpat commented 3 years ago

@richardliaw No, I am running redis server on my local Windows system. I tried configuring ray to use localhost and 127.., when it tests the connection it works, but then it reverts the IP to 192... and fails.

RXminuS commented 3 years ago

Having the same issue, I don't know if it's the same problem but it seems weird that the IP address configuration isn't respected at least

YuanfengZhang commented 2 years ago

I meet the same problem. None independent radis installed before so Ray was handling its own redis server.

wuisawesome commented 2 years ago

@pcmoritz should we consider bumping the priority of this?

pcmoritz commented 2 years ago

I'm assigning this to you @mwtian since you are touching this codepath as part of the GCS work. Let us know if you need help working on this / if it turns out to be windows specific :)

mattip commented 2 years ago

FWIW, I can reproduce this on linux with latest HEAD so I think the Windows label can be removed. I get empty error messages every ~20 seconds and then, when I hit ^C get a traceback.

>>> ray.init(address="127.0.0.1:6379")
2021-12-29 17:41:27,482 INFO worker.py:852 -- Connecting to existing Ray cluster at address: 10.0.0.19:6379
2021-12-29 17:41:47,509 ERROR node.py:1342 -- ERROR as
2021-12-29 17:42:09,537 ERROR node.py:1342 -- ERROR as
^CTraceback (most recent call last):
  File "/home/matti/miniconda3/envs/ray_dev/lib/python3.9/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/home/matti/miniconda3/envs/ray_dev/lib/python3.9/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/home/matti/miniconda3/envs/ray_dev/lib/python3.9/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matti/ray_dev/python/ray/node.py", line 502, in get_gcs_client
    self._gcs_client = GcsClient(address=self.gcs_address)
  File "/home/matti/ray_dev/python/ray/node.py", line 409, in gcs_address
    return get_gcs_address_from_redis(redis)
  File "/home/matti/ray_dev/python/ray/_private/gcs_utils.py", line 110, in get_gcs_address_from_redis
    gcs_address = redis.get("GcsServerAddress")
  File "/home/matti/miniconda3/envs/ray_dev/lib/python3.9/site-packages/redis/client.py", line 1606, in get
    return self.execute_command('GET', name)
  File "/home/matti/miniconda3/envs/ray_dev/lib/python3.9/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/matti/miniconda3/envs/ray_dev/lib/python3.9/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/home/matti/miniconda3/envs/ray_dev/lib/python3.9/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 10.0.0.19:6379. Connection refused.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/matti/ray_dev/python/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/matti/ray_dev/python/ray/worker.py", line 954, in init
    _global_node = ray.node.Node(
  File "/home/matti/ray_dev/python/ray/node.py", line 165, in __init__
    session_name = self._internal_kv_get_with_retry(
  File "/home/matti/ray_dev/python/ray/node.py", line 1340, in _internal_kv_get_with_retry
    result = self.get_gcs_client().internal_kv_get(key, namespace)
  File "/home/matti/ray_dev/python/ray/node.py", line 505, in get_gcs_client
    time.sleep(1)
KeyboardInterrupt
mwtian commented 2 years ago

@mattip Just to confirm this is the same issue:

mattip commented 2 years ago

It seems the change to disallow 127.0.01 as a valid address came from PR #1556, with the comment

The main issue was that localhost was getting resolved to the loopback ip, which wasn't very helpful since services are registered with their node ip. This fixes the address getter function to never return the loopback ip.

Perhaps there could be differentiation between situations where a user prefers that all services run on 127.0.0.1, or prefers localhost, or does not have a preference.

mattip commented 2 years ago

Since redis is no longer the default message broker, can we close this?

mwtian commented 2 years ago

@mattip can you see if the problem is still reproducible on Windows? Since ray.init() may still connect to the local global control store process, it is possible the problem still exists.

YuanfengZhang commented 2 years ago

@mattip can you see if the problem is still reproducible on Windows? Since ray.init() may still connect to the local global control store process, it is possible the problem still exists.

I can help answer this. Ray and ray-based modin works well now. Here is the env info. ray_success.yaml.txt image

mwtian commented 2 years ago

Good to see! Thanks for confirming @YuanfengZhang. Closing the issue.