ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.55k stars 5.7k forks source link

[Ray Serve] Unable to connect to GCS with ray start --head, but works from inside python #24920

Closed Joshuaalbert closed 2 years ago

Joshuaalbert commented 2 years ago

What happened + What you expected to happen

When I try follow this tutorial for deploying on a single node, and I start up a ray head node using ray start --head, it fails to start up (see below error).

However, when I start a server up from inside a python script it works as expected (see below). I want to be able to do it the prior way to make use of Serve’s ability to dynamically update running deployments.

Versions / Dependencies

ray, version 1.12.1
Redis server v=6.0.15 sha=00000000:0 malloc=jemalloc-5.2.1 bits=64 build=d583da279d383435

Reproduction script

ray start --head

Observe the following

2022-05-18 10:13:12,091 WARNING utils.py:1254 -- Unable to connect to GCS at 10.0.0.105:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

However, this works:

import ray
from ray import serve

serve.start()

while True:
  pass

Observe:

2022-05-18 10:34:37,061 INFO services.py:1456 -- View the Ray dashboard at http://127.0.0.1:8265
(ServeController pid=10417) 2022-05-18 10:34:40,010 INFO checkpoint_path.py:15 -- Using RayInternalKVStore for controller checkpoint and recovery.
(ServeController pid=10417) 2022-05-18 10:34:40,118 INFO http_state.py:106 -- Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:yZdKhI:SERVE_PROXY_ACTOR-node:10.0.0.105-0' on node 'node:10.0.0.105-0' listening on '127.0.0.1:8000'
2022-05-18 10:34:40,986 INFO api.py:794 -- Started Serve instance in namespace 'serve'.

Issue Severity

High: It blocks me from completing my task.

DmitriGekhtman commented 2 years ago

Is it possible there are left-over background Ray processes? Would running ray stop first help?

@simon-mo Is this a Ray core issue?

Joshuaalbert commented 2 years ago

There were no actors:

$ ray stop
Did not find any active Ray processes.

What worked was stopping redis service:

$ sudo service redis-server stop

Then ray start --head worked.

DmitriGekhtman commented 2 years ago

...oh, I see what happened here.

It used to be that the Ray head used redis to store state and Ray workers connected to the head by referencing the address of the head's Redis server (default port 6379).

Ray stopped using redis in 1.11.0 -- to make that transition less disruptive, we made it so that workers still connect to the head by referencing port :6379. To make that possible, we need to make the head's GCS server run on port 6379 by default.

Unfortunately, as we see here, that opens the possibility of port conflict with a pre-existing Redis process.

DmitriGekhtman commented 2 years ago

Looks like we need a better error message here and/or an FAQ item in the docs. @mwtian would you mind opening an issue to track that?

Joshuaalbert commented 2 years ago

Quick question then: is a solution here to simply point Ray at my other redis server (which I'm using for redis-graph)?

mwtian commented 2 years ago

@Joshuaalbert you can set RAY_REDIS_ADDRESS=<Redis address> in the environment before starting Ray, to use Redis as an external storage. Also, you would need to start Ray on a port different from Redis. e.g.

RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --head --port=8765
mwtian commented 2 years ago

@DmitriGekhtman filed #24985 and #24986 for GCS start up failure error message and external Redis doc.

Joshuaalbert commented 2 years ago

@mwtian I'm getting an error with that.

Launch a docker with redis:

docker run -p 6379:6379 -it --rm redislabs/redisgraph

Then launch ray pointing there:

RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --head --port=8765

Observe:

File "/home/albert/miniconda3/envs/tf_py/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2269, in main
    return cli()
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
    return f(*args, **kwargs)
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/scripts/scripts.py", line 719, in start
    node = ray.node.Node(
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/node.py", line 101, in __init__
    ray._private.services.wait_for_redis_to_start(
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/_private/services.py", line 832, in wait_for_redis_to_start
    redis_client.client_list()
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/commands/core.py", line 531, in client_list
    return self.execute_command("CLIENT LIST", *args, **kwargs)
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/client.py", line 1224, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/connection.py", line 1392, in get_connection
    connection.connect()
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/connection.py", line 626, in connect
    self.on_connect()
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/connection.py", line 716, in on_connect
    auth_response = self.read_response()
  File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/connection.py", line 842, in read_response
    raise response
redis.exceptions.ResponseError: AUTH <password> called without any password configured for the default user. Are you sure your configuration is correct?
Joshuaalbert commented 2 years ago

Note, that we can be sure that redis is available in the container since redis-cli -h 127.0.0.1 -p 6379 works (and we did sudo service redis-server stop before so the only redis on that port is in the container).

mwtian commented 2 years ago

I see. Ray by default assumes Redis is configured with a password. Can you try:

  1. Specify a default password when starting Redis, e.g. redis-server --requirepass=xyz
  2. Set Redis password when starting Ray, e.g. ray start --head --redis-password=xyz

If you use Ray's default Redis password 5241590000000000, the 2nd step can be simplified to ray start --head too.

Joshuaalbert commented 2 years ago

The issue persists:

$ sudo service redis-server stop
$ REDIS_PASSWORD=1234 redis-server

Start Ray

$ RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --head --port=8754 --redis-password=1234

Same response redis.exceptions.ResponseError: AUTH <password> called without any password configured for the default user. Are you sure your configuration is correct?

mwtian commented 2 years ago

Can you verify with redis-cli that REDIS_PASSWORD actually set the password on Redis server? The command to start Ray looks right.

Joshuaalbert commented 2 years ago

I stand corrected, redis-server --requirepass "1234" is the the right way to set the password from command line..

mwtian commented 2 years ago

Great that it works now!

xlla commented 2 years ago

I run into same issue

RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --address=127.0.0.1:8765 --redis-password=1234
Local node IP: 127.0.0.1
2022-06-17 02:33:15,872 WARNING utils.py:1250 -- Unable to connect to GCS at 127.0.0.1:8765. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
2022-06-17 02:33:22,886 WARNING utils.py:1250 -- Unable to connect to GCS at 127.0.0.1:8765. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

redis is running

redis-cli -h 127.0.0.1 -p 6379
127.0.0.1:6379> exit

python version

>>> import ray
>>> from ray import serve
>>> 
>>> serve.start()
File descriptor limit 256 is too low for production servers and may result in connection errors. At least 8192 is recommended. --- Fix with 'ulimit -n 8192'

2022-06-17 02:28:39,043 WARNING utils.py:1290 -- Unable to connect to GCS at 127.0.0.1:63326. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

node

RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --address=127.0.0.1:6379 --redis-password=1234
Local node IP: 127.0.0.1
2022-06-17 02:26:01,638 WARNING utils.py:1250 -- Unable to connect to GCS at 127.0.0.1:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
2022-06-17 02:26:08,666 WARNING utils.py:1250 -- Unable to connect to GCS at 127.0.0.1:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
 ray --version
ray, version 3.0.0.dev0

pip version 1.5.2 was work, but some tensorflow relevant functions failure, so I build ray from source and encounter this issue.

how to let ray GCS server running?

it seems every feature is depend on it.


>>> algo = PPO(config=config)
2022-06-16 20:22:38,030 WARNING algorithm.py:2074 -- You have specified 1 evaluation workers, but your `evaluation_interval` is None! Therefore, evaluation will not occur automatically with each call to `Trainer.train()`. Instead, you will have to call `Trainer.evaluate()` manually in order to trigger an evaluation run.
2022-06-16 20:22:43,086 WARNING utils.py:1290 -- Unable to connect to GCS at 127.0.0.1:60804. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
2022-06-16 20:23:08,090 WARNING utils.py:1290 -- Unable to connect to GCS at 127.0.0.1:60804. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
>>> trainer = PPOTrainer(
...     config={
...         # Env class to use (here: our gym.Env sub-class from above).
...         "env": SimpleCorridor,
...         # Config dict to be passed to our custom env's constructor.
...         "env_config": {
...             # Use corridor with 20 fields (including S and G).
...             "corridor_length": 20
...         },
...         # Parallelize environment rollouts.
...         "num_workers": 3,
...     })
2022-06-17 01:54:38,281 WARNING ppo.py:350 -- `train_batch_size` (4000) cannot be achieved with your other settings (num_workers=3 num_envs_per_worker=1 rollout_fragment_length=200)! Auto-adjusting `rollout_fragment_length` to 1334.

2022-06-17 01:54:43,348 WARNING utils.py:1290 -- Unable to connect to GCS at 127.0.0.1:65169. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

DmitriGekhtman commented 2 years ago

@mwtian what was the conclusion when we closed this issue? Is there are a flag that should be passed when starting the external Redis? Any pointers we should add to the Ray docs on setting up external Redis storage? cc @scv119 as well.

Spartee commented 2 years ago

Looking for more pointers on using external redis storage as well.

simon-mo commented 2 years ago

Hi @Spartee can you expand more about your use case?

Spartee commented 2 years ago

Well I've seen

I'm particularly interested in HA implementations of Ray Serve with external Redis instances. Particularly this issue

Are there any more resources/implementations of this out there?

simon-mo commented 2 years ago

ah yes. there is a recently support for this feature implemented in the cluster manager layer: https://ray-project.github.io/kuberay/guidance/gcs-ft/

Spartee commented 2 years ago

@simon-mo perfect. Exactly what I was looking for. Thank you.