Closed Joshuaalbert closed 2 years ago
Is it possible there are left-over background Ray processes? Would running ray stop
first help?
@simon-mo Is this a Ray core issue?
There were no actors:
$ ray stop
Did not find any active Ray processes.
What worked was stopping redis service:
$ sudo service redis-server stop
Then ray start --head
worked.
...oh, I see what happened here.
It used to be that the Ray head used redis to store state and Ray workers connected to the head by referencing the address of the head's Redis server (default port 6379).
Ray stopped using redis in 1.11.0 -- to make that transition less disruptive, we made it so that workers still connect to the head by referencing port
Unfortunately, as we see here, that opens the possibility of port conflict with a pre-existing Redis process.
Looks like we need a better error message here and/or an FAQ item in the docs. @mwtian would you mind opening an issue to track that?
Quick question then: is a solution here to simply point Ray at my other redis server (which I'm using for redis-graph)?
@Joshuaalbert you can set RAY_REDIS_ADDRESS=<Redis address>
in the environment before starting Ray, to use Redis as an external storage. Also, you would need to start Ray on a port different from Redis. e.g.
RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --head --port=8765
@DmitriGekhtman filed #24985 and #24986 for GCS start up failure error message and external Redis doc.
@mwtian I'm getting an error with that.
Launch a docker with redis:
docker run -p 6379:6379 -it --rm redislabs/redisgraph
Then launch ray pointing there:
RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --head --port=8765
Observe:
File "/home/albert/miniconda3/envs/tf_py/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2269, in main
return cli()
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
return f(*args, **kwargs)
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/scripts/scripts.py", line 719, in start
node = ray.node.Node(
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/node.py", line 101, in __init__
ray._private.services.wait_for_redis_to_start(
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/ray/_private/services.py", line 832, in wait_for_redis_to_start
redis_client.client_list()
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/commands/core.py", line 531, in client_list
return self.execute_command("CLIENT LIST", *args, **kwargs)
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/client.py", line 1224, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/connection.py", line 1392, in get_connection
connection.connect()
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/connection.py", line 626, in connect
self.on_connect()
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/connection.py", line 716, in on_connect
auth_response = self.read_response()
File "/home/albert/miniconda3/envs/tf_py/lib/python3.8/site-packages/redis/connection.py", line 842, in read_response
raise response
redis.exceptions.ResponseError: AUTH <password> called without any password configured for the default user. Are you sure your configuration is correct?
Note, that we can be sure that redis is available in the container since redis-cli -h 127.0.0.1 -p 6379
works (and we did sudo service redis-server stop
before so the only redis on that port is in the container).
I see. Ray by default assumes Redis is configured with a password. Can you try:
redis-server --requirepass=xyz
ray start --head --redis-password=xyz
If you use Ray's default Redis password 5241590000000000
, the 2nd step can be simplified to ray start --head
too.
The issue persists:
$ sudo service redis-server stop
$ REDIS_PASSWORD=1234 redis-server
Start Ray
$ RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --head --port=8754 --redis-password=1234
Same response redis.exceptions.ResponseError: AUTH <password> called without any password configured for the default user. Are you sure your configuration is correct?
Can you verify with redis-cli
that REDIS_PASSWORD
actually set the password on Redis server? The command to start Ray looks right.
I stand corrected, redis-server --requirepass "1234"
is the the right way to set the password from command line..
Great that it works now!
I run into same issue
RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --address=127.0.0.1:8765 --redis-password=1234 Local node IP: 127.0.0.1 2022-06-17 02:33:15,872 WARNING utils.py:1250 -- Unable to connect to GCS at 127.0.0.1:8765. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access. 2022-06-17 02:33:22,886 WARNING utils.py:1250 -- Unable to connect to GCS at 127.0.0.1:8765. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
redis is running
redis-cli -h 127.0.0.1 -p 6379 127.0.0.1:6379> exit
python version
>>> import ray >>> from ray import serve >>> >>> serve.start() File descriptor limit 256 is too low for production servers and may result in connection errors. At least 8192 is recommended. --- Fix with 'ulimit -n 8192' 2022-06-17 02:28:39,043 WARNING utils.py:1290 -- Unable to connect to GCS at 127.0.0.1:63326. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
node
RAY_REDIS_ADDRESS=127.0.0.1:6379 ray start --address=127.0.0.1:6379 --redis-password=1234 Local node IP: 127.0.0.1 2022-06-17 02:26:01,638 WARNING utils.py:1250 -- Unable to connect to GCS at 127.0.0.1:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access. 2022-06-17 02:26:08,666 WARNING utils.py:1250 -- Unable to connect to GCS at 127.0.0.1:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
ray --version ray, version 3.0.0.dev0
pip version 1.5.2 was work, but some tensorflow relevant functions failure, so I build ray from source and encounter this issue.
how to let ray GCS server running?
it seems every feature is depend on it.
>>> algo = PPO(config=config) 2022-06-16 20:22:38,030 WARNING algorithm.py:2074 -- You have specified 1 evaluation workers, but your `evaluation_interval` is None! Therefore, evaluation will not occur automatically with each call to `Trainer.train()`. Instead, you will have to call `Trainer.evaluate()` manually in order to trigger an evaluation run. 2022-06-16 20:22:43,086 WARNING utils.py:1290 -- Unable to connect to GCS at 127.0.0.1:60804. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access. 2022-06-16 20:23:08,090 WARNING utils.py:1290 -- Unable to connect to GCS at 127.0.0.1:60804. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
>>> trainer = PPOTrainer( ... config={ ... # Env class to use (here: our gym.Env sub-class from above). ... "env": SimpleCorridor, ... # Config dict to be passed to our custom env's constructor. ... "env_config": { ... # Use corridor with 20 fields (including S and G). ... "corridor_length": 20 ... }, ... # Parallelize environment rollouts. ... "num_workers": 3, ... }) 2022-06-17 01:54:38,281 WARNING ppo.py:350 -- `train_batch_size` (4000) cannot be achieved with your other settings (num_workers=3 num_envs_per_worker=1 rollout_fragment_length=200)! Auto-adjusting `rollout_fragment_length` to 1334. 2022-06-17 01:54:43,348 WARNING utils.py:1290 -- Unable to connect to GCS at 127.0.0.1:65169. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
@mwtian what was the conclusion when we closed this issue? Is there are a flag that should be passed when starting the external Redis? Any pointers we should add to the Ray docs on setting up external Redis storage? cc @scv119 as well.
Looking for more pointers on using external redis storage as well.
Hi @Spartee can you expand more about your use case?
Well I've seen
I'm particularly interested in HA implementations of Ray Serve with external Redis instances. Particularly this issue
Are there any more resources/implementations of this out there?
ah yes. there is a recently support for this feature implemented in the cluster manager layer: https://ray-project.github.io/kuberay/guidance/gcs-ft/
@simon-mo perfect. Exactly what I was looking for. Thank you.
What happened + What you expected to happen
When I try follow this tutorial for deploying on a single node, and I start up a ray head node using
ray start --head
, it fails to start up (see below error).However, when I start a server up from inside a python script it works as expected (see below). I want to be able to do it the prior way to make use of Serve’s ability to dynamically update running deployments.
Versions / Dependencies
Reproduction script
Observe the following
However, this works:
Observe:
Issue Severity
High: It blocks me from completing my task.