Open yuduber opened 3 years ago
cc @edoakes @rkooo567 Have you seen this before or suggest next steps?
@yuduber is an important collaborator, though he is on 1.3.0.
Is there some way to specify the metrics-export port?
I think metrics export port can be chosen (—metrics-export-port) Metrics agent port is probably not, but this will be fixed in the master soon
Can we create a test that checks to make sure that all ports are configurable?
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
Hi @rkooo567, it seems the ticket is automatically closed, and I wonder if this has been fixed. Thanks!
Let me check this soon
This wasn't fixed in the master. It happens because once in 100 times, the port randomly selected for agent & metrics conflict. We need to avoid choosing a random port when it is already assigned to sth else
Duplicate of https://github.com/ray-project/ray/issues/25793
Should this be 2.4?
The fix is not very easy unfortunately... Let's defer it to the later versions (we can work around it by setting all ports manually)
We just randomly saw this when running ray start --head --port=6379
in Ray 2.2.0.
ValueError: Ray component metrics_export is trying to use a port number 52365 that is used by other components.
Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 6379, 'client_server': 10001, 'dashboard': 8265, 'dashboard_agent_grpc': 64676, 'dashboard_agent_http': 52365, 'metrics_export': 52365, 'redis_shards': 'random', 'worker_ports': '9998 ports from 10002 to 19999'}
Has it been solved since Ray 2.2.0? Seems not quite a duplicate of https://github.com/ray-project/ray/issues/25793, since that one was specifically about worker ports.
dashboard_agent_http': 52365, 'metrics_export': 52365,
I think it is extremely unlucky case. I don't think we made special changes here except using more fixed ports in general in production.
I see, looks like we rolled the ray start
dice enough times to trigger this once :)
In the printed port information, what does it mean that some ports are random
, whereas some that we didn't specify (like metrics_export
) have assigned numbers?
random means it is chosen when a process starts randomly. Some of procs cannot do this due to some implementation limitation, and they pre-choose a port (or it is hardcoded). Generally, if it is deployed in prod, it is a good idea to set all ports manually.
pre-choose a port
Are the ports that do not have defaults and are randomly pre-chosen just the dashboard ports
dashboard_agent_grpc
, dashboard_agent_http
, and metrics_export
?
I mean if we set these three manually and correctly, do we remove the possibility of ray start
non-deterministically conflicting with itself?
Are the ports that do not have defaults and are randomly pre-chosen just the dashboard ports
Took a closer look at the code, looks like it's actually just dashboard_agent_grpc
and dashboard_agent_http
that have this property.
Would it make sense to provide reasonable defaults for these two?
I think unless you set all the ports manually, there's always a small possibility of conflict (unless we start dashboard_agent before starting other procs, but that's not the case). So the ideal case is to set all ports manually.
I think unless you set all the ports manually, there's always a small possibility of conflict (unless we start dashboard_agent before starting other procs, but that's not the case). So the ideal case is to set all ports manually.
It's not possible to set all ports manually when Ray runs in a multi-tenant CI/CD setup.
@kaya you mean you run ray start multiple times within the same instance right (for parallel unit test or sth like that)?
This wasn't fixed in the master. It happens because once in 100 times, the port randomly selected for agent & metrics conflict. We need to avoid choosing a random port when it is already assigned to sth else
In our application, this seems to occur with more than a 1% probability. It has happened multiple times in the past year and a simple restart cannot solve it. We use Docker to deploy nodes, strangely there is always the same port conflict every time restarts.
It's good that the issue can be worked around by coding up some port allocation logic, or retrying. It's pretty bad that the basic Ray start API has a random chance of failure, for known reasons.
I'm running into the same issue, where dashboard_agent_http
is trying to use the same port as dashboard_agent_grpc
.
ValueError: Ray component dashboard_agent_http is trying to use a port number 52365 that is used by other components.
Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 9339, 'client_server': 'random', 'dashboard': 8265, 'dashboard_agent_grpc': 52365, 'dashboard_agent_http': 52365, 'dashboard_grpc': 'random', 'runtime_env_agent': 48745, 'metrics_export': 45339, 'redis_shards': 'random', 'worker_ports': '9998 ports from 10002 to 19999'}
@DmitriGekhtman for your workaround, did you go with the "retry approach" (i.e. re-run ray start --head
if you encounter the above ValueError
) or were you able to manually configure the ports used by dashboard_agent_http
and dashboard_agent_grpc
?
We haven't fixed the issue yet -- for us, it's rare enough to sit in the backlog for a while (, plus most of our jobs are wrapped in global retries, i.e. we'll typically automatically create a distinct set of Ray pods in case of startup failure.)
When we get around to fixing it, we'll aim to specify all of the ports (since we're not totally sure if retrying would have side-effects.)
This wasn't fixed in the master. It happens because once in 100 times, the port randomly selected for agent & metrics conflict. We need to avoid choosing a random port when it is already assigned to sth else
In our application, this seems to occur with more than a 1% probability. It has happened multiple times in the past year and a simple restart cannot solve it. We use Docker to deploy nodes, strangely there is always the same port conflict every time restarts.
a simple restart cannot solve it I'm encountering the same thing, it happens when a worker group starts to scale up, the master group works fine. What do you mean by simple re-start? Restart of what?
This unfortunate conflict problem is often encountered when running multiple ray programs at the same time. Is there any way to avoid it? How do I assign different ports to different programs? Thank you!
[2024-01-16, 03:29:11 CST] {logging_mixin.py:137} INFO - ValueError: Ray component dashboard_agent_http is trying to use a port number 52365 that is used by other components. [2024-01-16, 03:29:11 CST] {logging_mixin.py:137} INFO - Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 60025, 'client_server': 'random', 'dashboard': 'random', 'dashboard_agent_grpc': 52365, 'dashboard_agent_http': 52365, 'dashboard_grpc': 'random', 'metrics_export': 51216, 'redis_shards': 'random', 'worker_ports': 'random'} [2024-01-16, 03:29:11 CST] {logging_mixin.py:137} INFO - If you allocate ports, please make sure the same port is not used by multiple components.
I strongly recommend you to set every port manually when you deploy Ray https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations to avoid port conflict.
I strongly recommend you to set every port manually when you deploy Ray https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations to avoid port conflict.
Thank you @rkooo567 !
Although I am specifying ports in my ray start
command, I still encounter random ports being used with this example: https://docs.ray.io/en/releases-2.9.0/train/examples/lightning/dolly_lightning_fsdp_finetuning.html
Why is Ray train using random ports between workers...?
Am I missing something in my ray start
command...?
Python version: 3.10.13 Ray version: 2.9.0
Head node ray start
:
sudo /opt/conda/envs/rayenv/bin/ray start --head --disable-usage-stats \
--node-manager-port=6380 \
--object-manager-port=6381 \
--runtime-env-agent-port=6382 \
--dashboard-agent-grpc-port=6383 \
--metrics-export-port=6384 \
--min-worker-port=10010 \
--max-worker-port=11010 \
--redis-shard-ports=8266 \
--dashboard-grpc-port=8267
Worker nodes ray start
:
sudo /opt/conda/envs/rayenv/bin/ray start --address=instance-11-head:6379 \
--node-manager-port=6380 \
--object-manager-port=6381 \
--runtime-env-agent-port=6382 \
--dashboard-agent-grpc-port=6383 \
--metrics-export-port=6384 \
--min-worker-port=10010 \
--max-worker-port=11010
@tpatpa can you clarify how you know that Ray is still using random ports? Did you get an error message that looks something like this?
ValueError: Ray component dashboard_agent_http is trying to use a port number 52365 that is used by other components.
Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 9339, 'client_server': 'random', 'dashboard': 8265, 'dashboard_agent_grpc': 52365, 'dashboard_agent_http': 52365, 'dashboard_grpc': 'random', 'runtime_env_agent': 48745, 'metrics_export': 45339, 'redis_shards': 'random', 'worker_ports': '9998 ports from 10002 to 19999'}
Are you saying that the port numbers that you configured for the worker nodes when running ray start
are getting ignored?
What is the problem?
encountered an issue with occasional Ray port conflict. Ray component is trying to use a port number xxx that is used by other components. Ray 1.3.0
Reproduction (REQUIRED)
start the head node as normal, start the worker node with command below in a could: ray start --address=agent10909-phx4.prod.uber.internal:31014 --object-manager-port=31009 --worker-port-list=31034,31035,31046,31047,31048,31049,31061,31062,31063,31064,31065,31066 --num-cpus=10 --num-gpus=1 --block
We estimate 1 out of 100 run, this issue will happen.
The worker node won't be able to start. log looks like below. I see Ray itself pickup same port for dashboard_agent and metrics_export, which we didn't specify in our ray start up command.