ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.3k stars 5.63k forks source link

[Core] Same port is being assigned to different components #45484

Closed danielezhu closed 2 weeks ago

danielezhu commented 4 months ago

What happened + What you expected to happen

The Bug

Background

When running ray start, I occasionally run into the following error:

ValueError: Ray component dashboard_agent_http is trying to use a port number 52365 that is used by other components.

Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 9339, 'client_server': 'random', 'dashboard': 8265, 'dashboard_agent_grpc': 52365, 'dashboard_agent_http': 52365, 'dashboard_grpc': 'random', 'runtime_env_agent': 48745, 'metrics_export': 45339, 'redis_shards': 'random', 'worker_ports': '9998 ports from 10002 to 19999'}

If you allocate ports, please make sure the same port is not used by multiple components.

This error comes from this code in the update_pre_selected_port method.

We see from the logs dashboard_agent_grpc and dashboard_agent_http are being assigned the same port number. This should never happen in theory, since prior to calling update_pre_selected_port in node.py, we call self._get_cached_port on a bunch of ports, including "metrics_agent_port" (which corresponds to "dashboard_agent_grpc") and "dashboard_agent_listen_port" (which corresponds to "dashboard_agent_http").

self._get_cached_port should assign an unused port number via self._get_unused_port, which takes in a set of currently-used port numbers to ensure no duplicates are chosen.

The erroneous code

The reason why a duplicate value is chosen despite the fact that we call self._get_unused_port is the fact that the RayParams initializer has a default value for the parameter dashboard_agent_listen_port which gets used when we call self._get_cached_port, and because we call self._get_cached_port on "metrics_agent_port" before calling it on "dashboard_agent_listen_port".

Because we call self._get_cached_port on "metric_agent_port" first, it's possible to choose 52365 (which is the default value for dashboard_agent_listen_port), since at that moment in time, no other component is using this port number. When we later call self._get_cached_port on "dashboard_agent_listen_port", since we provide a default_port argument that isn't 0 or None, we will use that value (which again, is 52365 by default) instead of running the logic for obtaining an unused port number.

I believe that the fix should be quite simple: simply call self._get_cached_port on dashboard_agent_listen_port first, before the other 3 components. The other 3 don't have non-null default values for their port number, so they will all get non-duplicate port numbers through _get_unused_port.

Expected Behavior

I should be able to call ray start without explicitly configuring the dashboard_agent_grpc and dashboard_agent_http parameters, and not run into issues with duplicate port numbers getting assigned by Ray.

Useful information

Here are the logs once again:

ValueError: Ray component dashboard_agent_http is trying to use a port number 52365 that is used by other components.

Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 9339, 'client_server': 'random', 'dashboard': 8265, 'dashboard_agent_grpc': 52365, 'dashboard_agent_http': 52365, 'dashboard_grpc': 'random', 'runtime_env_agent': 48745, 'metrics_export': 45339, 'redis_shards': 'random', 'worker_ports': '9998 ports from 10002 to 19999'}

If you allocate ports, please make sure the same port is not used by multiple components.

I believe that all of the information I presented in "The Bug" should suffice.

Versions / Dependencies

Ray: 2.9.1 Python: Python 3.10.14 OS: Whatever AWS EC2 instances use (generally, Amazon Linux)

Reproduction script

Run this enough times, and you'll eventually hit the error by chance.

output = subprocess.run(
    [
        "ray",
        "start",
        "--head",
        "-vvv",
        "--port",
        "9339",
        # We don't need a dashboard
        "--include-dashboard",
        "false",
    ],
    stdout=subprocess.PIPE,
)

ray.init(address="auto", include_dashboard=False)

Issue Severity

Low: It annoys or frustrates me.

jjyao commented 4 months ago

@danielezhu, nice investigation. Are you willing to create a PR to fix it?

danielezhu commented 4 months ago

@danielezhu, nice investigation. Are you willing to create a PR to fix it?

Sure!

anyscalesam commented 3 months ago

@danielezhu is there a PR out for this?

danielezhu commented 3 months ago

@danielezhu is there a PR out for this?

Hi, sorry for the delay. I got busy and lost track of this. I haven't contributed to Ray before, so I'll have to do a fair amount of setup before raising the PR. Given that the PR basically amounts to switching a couple of lines of code, could someone else take this up? I unfortunately have very low bandwidth at the moment. If absolutely necessary, I can take it up, but I cannot guarantee any specific timelines, unfortunately.

Superskyyy commented 3 months ago

To make it more future proof, I would adjust the port assignment code to also check for port conflict before returning, just to be safe. Although the code swap is still necessary to have dashboard agent taking the default port whenever it can.

Superskyyy commented 3 weeks ago

Can be closed as #47437 is merged.