Closed danielezhu closed 2 weeks ago
@danielezhu, nice investigation. Are you willing to create a PR to fix it?
@danielezhu, nice investigation. Are you willing to create a PR to fix it?
Sure!
@danielezhu is there a PR out for this?
@danielezhu is there a PR out for this?
Hi, sorry for the delay. I got busy and lost track of this. I haven't contributed to Ray before, so I'll have to do a fair amount of setup before raising the PR. Given that the PR basically amounts to switching a couple of lines of code, could someone else take this up? I unfortunately have very low bandwidth at the moment. If absolutely necessary, I can take it up, but I cannot guarantee any specific timelines, unfortunately.
To make it more future proof, I would adjust the port assignment code to also check for port conflict before returning, just to be safe. Although the code swap is still necessary to have dashboard agent taking the default port whenever it can.
Can be closed as #47437 is merged.
What happened + What you expected to happen
The Bug
Background
When running
ray start
, I occasionally run into the following error:This error comes from this code in the
update_pre_selected_port
method.We see from the logs
dashboard_agent_grpc
anddashboard_agent_http
are being assigned the same port number. This should never happen in theory, since prior to callingupdate_pre_selected_port
innode.py
, we callself._get_cached_port
on a bunch of ports, including"metrics_agent_port"
(which corresponds to"dashboard_agent_grpc"
) and"dashboard_agent_listen_port"
(which corresponds to"dashboard_agent_http"
).self._get_cached_port
should assign an unused port number viaself._get_unused_port
, which takes in a set of currently-used port numbers to ensure no duplicates are chosen.The erroneous code
The reason why a duplicate value is chosen despite the fact that we call
self._get_unused_port
is the fact that theRayParams
initializer has a default value for the parameterdashboard_agent_listen_port
which gets used when we callself._get_cached_port
, and because we callself._get_cached_port
on"metrics_agent_port"
before calling it on"dashboard_agent_listen_port"
.Because we call
self._get_cached_port
on"metric_agent_port"
first, it's possible to choose52365
(which is the default value fordashboard_agent_listen_port
), since at that moment in time, no other component is using this port number. When we later callself._get_cached_port
on"dashboard_agent_listen_port"
, since we provide adefault_port
argument that isn't0
orNone
, we will use that value (which again, is52365
by default) instead of running the logic for obtaining an unused port number.I believe that the fix should be quite simple: simply call
self._get_cached_port
ondashboard_agent_listen_port
first, before the other 3 components. The other 3 don't have non-null default values for their port number, so they will all get non-duplicate port numbers through_get_unused_port
.Expected Behavior
I should be able to call
ray start
without explicitly configuring thedashboard_agent_grpc
anddashboard_agent_http
parameters, and not run into issues with duplicate port numbers getting assigned by Ray.Useful information
Here are the logs once again:
I believe that all of the information I presented in "The Bug" should suffice.
Versions / Dependencies
Ray:
2.9.1
Python:Python 3.10.14
OS: Whatever AWS EC2 instances use (generally, Amazon Linux)Reproduction script
Run this enough times, and you'll eventually hit the error by chance.
Issue Severity
Low: It annoys or frustrates me.