ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[core] occasional Ray port conflict issue #18053

Open yuduber opened 3 years ago

yuduber commented 3 years ago

What is the problem?

encountered an issue with occasional Ray port conflict. Ray component is trying to use a port number xxx that is used by other components. Ray 1.3.0

Reproduction (REQUIRED)

start the head node as normal, start the worker node with command below in a could: ray start --address=agent10909-phx4.prod.uber.internal:31014 --object-manager-port=31009 --worker-port-list=31034,31035,31046,31047,31048,31049,31061,31062,31063,31064,31065,31066 --num-cpus=10 --num-gpus=1 --block

We estimate 1 out of 100 run, this issue will happen.

The worker node won't be able to start. log looks like below. I see Ray itself pickup same port for dashboard_agent and metrics_export, which we didn't specify in our ray start up command.

2021-08-22 07:50:43,072 INFO : worker_ports_str is 31034,31035,31046,31047,31048,31049,31061,31062,31063,31064,31065,31066
2021-08-22 07:50:43,073 INFO : Running ray worker with ray start --address=agent10909-phx4.prod.uber.internal:31014 --object-manager-port=31009 --worker-port-list=31034,31035,31046,31047,31048,31049,31061,31062,31063,31064,31065,31066 --num-cpus=10 --num-gpus=1 --block
/usr/lib/python3.6/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
"update your install command.", FutureWarning)
Traceback (most recent call last):
File "/usr/bin/ray", line 8, in <module>
sys.exit(main())
File "/usr/lib/python3.6/site-packages/ray/scripts/scripts.py", line 1706, in main
return cli()
File "/usr/lib/python3.6/site-packages/click/core.py", line 1137, in _call_
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.6/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/ray/scripts/scripts.py", line 657, in start
ray_params, head=False, shutdown_at_exit=block, spawn_reaper=block)
File "/usr/lib/python3.6/site-packages/ray/node.py", line 223, in _init_
self._ray_params.update_pre_selected_port()
File "/usr/lib/python3.6/site-packages/ray/_private/parameter.py", line 297, in update_pre_selected_port

ValueError: Ray component metrics_export is trying to use a port number 61240 that is used by other components.

Port information: {'gcs': [], 'object_manager': [31009], 'node_manager': [], 'gcs_server': [], 'client_server': [10001], 'dashboard': [8265], 'dashboard_agent': [61240], 'metrics_export': [61240], 'redis_shards': [], 'worker_ports': [31034, 31035, 31046, 31047, 31048, 31049, 31061, 31062, 31063, 31064, 31065, 31066]}
If you allocate ports, please make sure the same port is not used by multiple components.
I0822 07:50:44.000429     9 executor.cpp:1015] Command exited with status 0 (pid: 73)
I0822 07:50:45.002389    72 process.cpp:927] Stopped the socket accept loop
richardliaw commented 3 years ago

cc @edoakes @rkooo567 Have you seen this before or suggest next steps?

@yuduber is an important collaborator, though he is on 1.3.0.

Is there some way to specify the metrics-export port?

rkooo567 commented 3 years ago

I think metrics export port can be chosen (—metrics-export-port) Metrics agent port is probably not, but this will be fixed in the master soon

richardliaw commented 3 years ago

https://docs.ray.io/en/releases-1.3.0/ray-metrics.html?highlight=%E2%80%94metrics-export-port#getting-started-multi-nodes

Can we create a test that checks to make sure that all ports are configurable?

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

RuofanKong commented 2 years ago

Hi @rkooo567, it seems the ticket is automatically closed, and I wonder if this has been fixed. Thanks!

rkooo567 commented 2 years ago

Let me check this soon

rkooo567 commented 2 years ago

This wasn't fixed in the master. It happens because once in 100 times, the port randomly selected for agent & metrics conflict. We need to avoid choosing a random port when it is already assigned to sth else

hora-anyscale commented 1 year ago

Duplicate of https://github.com/ray-project/ray/issues/25793

rickyyx commented 1 year ago

Should this be 2.4?

rkooo567 commented 1 year ago

The fix is not very easy unfortunately... Let's defer it to the later versions (we can work around it by setting all ports manually)

DmitriGekhtman commented 11 months ago

We just randomly saw this when running ray start --head --port=6379 in Ray 2.2.0.

ValueError: Ray component metrics_export is trying to use a port number 52365 that is used by other components.

Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 6379, 'client_server': 10001, 'dashboard': 8265, 'dashboard_agent_grpc': 64676, 'dashboard_agent_http': 52365, 'metrics_export': 52365, 'redis_shards': 'random', 'worker_ports': '9998 ports from 10002 to 19999'}

Has it been solved since Ray 2.2.0? Seems not quite a duplicate of https://github.com/ray-project/ray/issues/25793, since that one was specifically about worker ports.

rkooo567 commented 11 months ago
dashboard_agent_http': 52365, 'metrics_export': 52365,

I think it is extremely unlucky case. I don't think we made special changes here except using more fixed ports in general in production.

DmitriGekhtman commented 11 months ago

I see, looks like we rolled the ray start dice enough times to trigger this once :)

DmitriGekhtman commented 11 months ago

In the printed port information, what does it mean that some ports are random, whereas some that we didn't specify (like metrics_export) have assigned numbers?

rkooo567 commented 11 months ago

random means it is chosen when a process starts randomly. Some of procs cannot do this due to some implementation limitation, and they pre-choose a port (or it is hardcoded). Generally, if it is deployed in prod, it is a good idea to set all ports manually.

DmitriGekhtman commented 11 months ago

pre-choose a port

Are the ports that do not have defaults and are randomly pre-chosen just the dashboard ports dashboard_agent_grpc , dashboard_agent_http, and metrics_export? I mean if we set these three manually and correctly, do we remove the possibility of ray start non-deterministically conflicting with itself?

DmitriGekhtman commented 11 months ago

Are the ports that do not have defaults and are randomly pre-chosen just the dashboard ports

Took a closer look at the code, looks like it's actually just dashboard_agent_grpc and dashboard_agent_http that have this property. Would it make sense to provide reasonable defaults for these two?

rkooo567 commented 11 months ago

I think unless you set all the ports manually, there's always a small possibility of conflict (unless we start dashboard_agent before starting other procs, but that's not the case). So the ideal case is to set all ports manually.

kaya commented 9 months ago

I think unless you set all the ports manually, there's always a small possibility of conflict (unless we start dashboard_agent before starting other procs, but that's not the case). So the ideal case is to set all ports manually.

It's not possible to set all ports manually when Ray runs in a multi-tenant CI/CD setup.

rkooo567 commented 9 months ago

@kaya you mean you run ray start multiple times within the same instance right (for parallel unit test or sth like that)?

FengLi666 commented 9 months ago

This wasn't fixed in the master. It happens because once in 100 times, the port randomly selected for agent & metrics conflict. We need to avoid choosing a random port when it is already assigned to sth else

In our application, this seems to occur with more than a 1% probability. It has happened multiple times in the past year and a simple restart cannot solve it. We use Docker to deploy nodes, strangely there is always the same port conflict every time restarts.

DmitriGekhtman commented 9 months ago

It's good that the issue can be worked around by coding up some port allocation logic, or retrying. It's pretty bad that the basic Ray start API has a random chance of failure, for known reasons.

danielezhu commented 8 months ago

I'm running into the same issue, where dashboard_agent_http is trying to use the same port as dashboard_agent_grpc.

ValueError: Ray component dashboard_agent_http is trying to use a port number 52365 that is used by other components.

Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 9339, 'client_server': 'random', 'dashboard': 8265, 'dashboard_agent_grpc': 52365, 'dashboard_agent_http': 52365, 'dashboard_grpc': 'random', 'runtime_env_agent': 48745, 'metrics_export': 45339, 'redis_shards': 'random', 'worker_ports': '9998 ports from 10002 to 19999'}

@DmitriGekhtman for your workaround, did you go with the "retry approach" (i.e. re-run ray start --head if you encounter the above ValueError) or were you able to manually configure the ports used by dashboard_agent_http and dashboard_agent_grpc?

DmitriGekhtman commented 8 months ago

We haven't fixed the issue yet -- for us, it's rare enough to sit in the backlog for a while (, plus most of our jobs are wrapped in global retries, i.e. we'll typically automatically create a distinct set of Ray pods in case of startup failure.)

When we get around to fixing it, we'll aim to specify all of the ports (since we're not totally sure if retrying would have side-effects.)

yc2984 commented 8 months ago

This wasn't fixed in the master. It happens because once in 100 times, the port randomly selected for agent & metrics conflict. We need to avoid choosing a random port when it is already assigned to sth else

In our application, this seems to occur with more than a 1% probability. It has happened multiple times in the past year and a simple restart cannot solve it. We use Docker to deploy nodes, strangely there is always the same port conflict every time restarts.

a simple restart cannot solve it I'm encountering the same thing, it happens when a worker group starts to scale up, the master group works fine. What do you mean by simple re-start? Restart of what?

Zhen-Zohn-WANG commented 8 months ago

This unfortunate conflict problem is often encountered when running multiple ray programs at the same time. Is there any way to avoid it? How do I assign different ports to different programs? Thank you!

[2024-01-16, 03:29:11 CST] {logging_mixin.py:137} INFO - ValueError: Ray component dashboard_agent_http is trying to use a port number 52365 that is used by other components. [2024-01-16, 03:29:11 CST] {logging_mixin.py:137} INFO - Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 60025, 'client_server': 'random', 'dashboard': 'random', 'dashboard_agent_grpc': 52365, 'dashboard_agent_http': 52365, 'dashboard_grpc': 'random', 'metrics_export': 51216, 'redis_shards': 'random', 'worker_ports': 'random'} [2024-01-16, 03:29:11 CST] {logging_mixin.py:137} INFO - If you allocate ports, please make sure the same port is not used by multiple components.

rkooo567 commented 8 months ago

I strongly recommend you to set every port manually when you deploy Ray https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations to avoid port conflict.

tpatpa commented 7 months ago

I strongly recommend you to set every port manually when you deploy Ray https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations to avoid port conflict.

Thank you @rkooo567 !

Although I am specifying ports in my ray start command, I still encounter random ports being used with this example: https://docs.ray.io/en/releases-2.9.0/train/examples/lightning/dolly_lightning_fsdp_finetuning.html

Why is Ray train using random ports between workers...? Am I missing something in my ray start command...?

Python version: 3.10.13 Ray version: 2.9.0

Head node ray start:

sudo /opt/conda/envs/rayenv/bin/ray start --head --disable-usage-stats \
--node-manager-port=6380 \ 
--object-manager-port=6381 \ 
--runtime-env-agent-port=6382 \ 
--dashboard-agent-grpc-port=6383 \ 
--metrics-export-port=6384 \ 
--min-worker-port=10010 \ 
--max-worker-port=11010 \ 
--redis-shard-ports=8266  \ 
--dashboard-grpc-port=8267

Worker nodes ray start:

sudo /opt/conda/envs/rayenv/bin/ray start --address=instance-11-head:6379 \ 
--node-manager-port=6380 \ 
--object-manager-port=6381 \ 
--runtime-env-agent-port=6382 \ 
--dashboard-agent-grpc-port=6383 \ 
--metrics-export-port=6384 \ 
--min-worker-port=10010 \ 
--max-worker-port=11010
danielezhu commented 4 months ago

@tpatpa can you clarify how you know that Ray is still using random ports? Did you get an error message that looks something like this?

ValueError: Ray component dashboard_agent_http is trying to use a port number 52365 that is used by other components.

Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 9339, 'client_server': 'random', 'dashboard': 8265, 'dashboard_agent_grpc': 52365, 'dashboard_agent_http': 52365, 'dashboard_grpc': 'random', 'runtime_env_agent': 48745, 'metrics_export': 45339, 'redis_shards': 'random', 'worker_ports': '9998 ports from 10002 to 19999'}

Are you saying that the port numbers that you configured for the worker nodes when running ray start are getting ignored?