ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.02k stars 5.78k forks source link

[Dashboard] [Bug] Dashboard fails to render #23788

Closed gcesars closed 1 year ago

gcesars commented 2 years ago

What happened + What you expected to happen

The dashboard does not render correctly, only a blank screen with a console error.

react-dom.production.min.js:209 TypeError: Cannot read properties of undefined (reading '/')
    at NodeInfo.tsx:220:42
    at Array.filter (<anonymous>)
    at _n (NodeInfo.tsx:220:26)
    at Xi (react-dom.production.min.js:153:146)
    at Go (react-dom.production.min.js:175:309)
    at ys (react-dom.production.min.js:263:406)
    at El (react-dom.production.min.js:246:265)
    at fl (react-dom.production.min.js:246:194)
    at sl (react-dom.production.min.js:239:172)
    at react-dom.production.min.js:123:115

Versions / Dependencies

ray: 1.10.0 python: 3.9.12 node: 17.8.0 kernel: 5.10.93 os: Debian Bullseye

Reproduction script

There is nothing special I am doing, just running the command below in the container. Running the same command in my mac M1, it works as expected.

ray start --head --num-cpus ${NUM_CPUS} --num-gpus ${NUM_GPUS} --dashboard-host 0.0.0.0 --verbose --redis-password ${REDIS_PASSWORD} --block \
    --log-color false --log-style record
gcesars commented 2 years ago

As far as I know, only the dashboard has an issue. I can connect to the node with my other machines and submit jobs normally.

gcesars commented 2 years ago

Tested in Windows 11 and Docker on the same machine. In Windows, it works as expected, in Docker the dashboard shows the same unexpected behaviour.

skabbit commented 2 years ago

Same here, but using Ray Cluster Launcher. Suppose, the problem is in ray docker image used for Ray Operator. Image was updated recently and the problem seems to appear just after one of updates.

gcesars commented 2 years ago

I created my own image from base Debian. It looks like to be an issue with the dashboard hosted in a virtualized environment, not the image itself. I will see if a simple demo React app shows similar behaviour if time allows.

gcesars commented 2 years ago

Another update. This is the stack trace on dashboar.log

Traceback (most recent call last):
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1649434759.726616703","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1649434759.726613713","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
clarkzinzow commented 2 years ago

@giucesar Have you tried using one of the nightly wheels, to see if this issue still comes up on latest master?

gcesars commented 2 years ago

It does not even start.

Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.

2022-04-08 17:00:32,442 INFO services.py:1460 -- View the Ray dashboard at http://172.17.0.2:8265

Traceback (most recent call last):

  File "/home/USERNAMEREDACTED/miniconda//bin/ray", line 8, in <module>

    sys.exit(main())

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2269, in main

    return cli()

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 1128, in __call__

    return self.main(*args, **kwargs)

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 1053, in main

    rv = self.invoke(ctx)

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 1659, in invoke

    return _process_result(sub_ctx.command.invoke(sub_ctx))

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 1395, in invoke

    return ctx.invoke(self.callback, **ctx.params)

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 754, in invoke

    return __callback(*args, **kwargs)

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper

    return f(*args, **kwargs)

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/scripts/scripts.py", line 719, in start

    node = ray.node.Node(

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/node.py", line 301, in __init__

    self.start_ray_processes()

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/node.py", line 1130, in start_ray_processes

    resource_spec = self.get_resource_spec()

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/node.py", line 472, in get_resource_spec

    self._resource_spec = ResourceSpec(

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/resource_spec.py", line 197, in resolve

    system_memory = ray._private.utils.get_system_memory()

  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/utils.py", line 436, in get_system_memory

    docker_limit = int(f.read())

ValueError: invalid literal for int() with base 10: 'max\n'

2022-04-08 17:00:26,348 INFO scripts.py:697 -- Local node IP: 172.17.0.2
clarkzinzow commented 2 years ago

Ah I've seen this issue when running Ray in Colab, it appears to be a bug in our memory limit utility for cgroups v2 which is somehow untested. This can be fixed by monkey-patching that utility before calling ray.init():

import psutil

ray._private.utils.get_system_memory = lambda: psutil.virtual_memory().total
gcesars commented 2 years ago

I understand what you said, just my noobness doesn't have any idea in how to do it :)

clarkzinzow commented 2 years ago

Ah I see that you're starting your Ray cluster from the CLI, that makes this monkey-patch a bit more difficult... One option would be using the auto-import feature of Python's site machinery by creating a sitecustomize.py module in your site-packages directory (e.g. ~/.local/lib/python3.7/site-packages) that does this monkey-patching at import-time; this will then be imported before your script is run.

# sitecustomize.py
import os

if "RAY_MONKEY_PATCH_MEMORY_LIMIT" in os.environ:
    import ray

    ray._private.utils.get_system_memory = lambda: psutil.virtual_memory().total

Then when you're starting Ray from the CLI, set that environment variable:

RAY_MONKEY_PATCH_MEMORY_LIMIT=1 ray start --head --num-cpus ${NUM_CPUS} --num-gpus ${NUM_GPUS} --dashboard-host 0.0.0.0 --verbose --redis-password ${REDIS_PASSWORD} --block \
    --log-color false --log-style record

I haven't tried this out, but I think that something like this should work.

gcesars commented 2 years ago

Implemented the patch (thanks for the tip). Now, a new error happens 🤦🏻‍♂️ It looks like the version ray-2.0.0.dev0-cp39-cp39-manylinux2014_x86_64.whl will not help us to identify the problem since it has some other issues.

2022-04-08 18:42:06,270 INFO scripts.py:697 -- Local node IP: 172.17.0.2

2022-04-08 18:42:12,917 SUCC scripts.py:739 -- --------------------

2022-04-08 18:42:12,918 SUCC scripts.py:740 -- Ray runtime started.

2022-04-08 18:42:12,920 SUCC scripts.py:741 -- --------------------

2022-04-08 18:42:12,920 INFO scripts.py:743 -- Next steps

2022-04-08 18:42:12,920 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run

2022-04-08 18:42:12,921 INFO scripts.py:747 --   ray start --address='172.17.0.2:6379'

2022-04-08 18:42:12,923 INFO scripts.py:752 -- Alternatively, use the following Python code:

2022-04-08 18:42:12,924 INFO scripts.py:754 -- import ray

2022-04-08 18:42:12,924 INFO scripts.py:758 -- ray.init(address='auto')

2022-04-08 18:42:12,924 INFO scripts.py:770 -- To connect to this Ray runtime from outside of the cluster, for example to

2022-04-08 18:42:12,924 INFO scripts.py:774 -- connect to a remote cluster from your laptop directly, use the following

2022-04-08 18:42:12,925 INFO scripts.py:778 -- Python code:

2022-04-08 18:42:12,925 INFO scripts.py:780 -- import ray

2022-04-08 18:42:12,928 INFO scripts.py:781 -- ray.init(address='ray://<head_node_ip_address>:10001')

2022-04-08 18:42:12,929 INFO scripts.py:790 -- If connection fails, check your firewall settings and network configuration.

2022-04-08 18:42:12,929 INFO scripts.py:798 -- To terminate the Ray runtime, run

2022-04-08 18:42:12,929 INFO scripts.py:799 --   ray stop

2022-04-08 18:42:12,929 INFO scripts.py:874 -- --block

2022-04-08 18:42:12,930 INFO scripts.py:875 -- This command will now block until terminated by a signal.

2022-04-08 18:42:12,930 INFO scripts.py:878 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.

2022-04-08 18:42:15,938 ERR scripts.py:889 -- Some Ray subprcesses exited unexpectedly:

2022-04-08 18:42:15,939 ERR scripts.py:893 -- raylet [exit code=1]

2022-04-08 18:42:15,939 ERR scripts.py:901 -- Remaining processes will be killed.
clarkzinzow commented 2 years ago

Could you check the raylet error logs? These should be located at /tmp/ray/session_latest/logs/raylet.err, so cat /tmp/ray/session_latest/logs/raylet.err should suffice.

gcesars commented 2 years ago
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
Traceback (most recent call last):
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/prometheus_client/exposition.py", line 167, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/prometheus_client/exposition.py", line 156, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py", line 407, in <module>
    gcs_publisher = GcsPublisher(args.gcs_address)
TypeError: __init__() takes 1 positional argument but 2 were given
clarkzinzow commented 2 years ago

@giucesar Ah yeah master was broken in the last few days by an updated unpinned dependency, you need to pip install -U prometheus_client==0.13.1.

https://github.com/ray-project/ray/issues/23799#issuecomment-1093082364

gcesars commented 2 years ago

With the prometheus version and monkey patch, the ray head started. It shows the same behaviour.

dashboard.log


2022-04-08 23:45:34,368 ERROR node_head.py:259 -- Error updating node stats of c34a1e9d2426d47584d1e8767fc381b57df6d77d668f30cabca21f07.
Traceback (most recent call last):
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 250, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{"created":"@1649461534.359356387","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1649461534.359163720","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

===========================================================

dashboard_agent.log

2022-04-08 23:45:03,449 INFO agent.py:83 -- Parent pid is 109
2022-04-08 23:45:03,464 INFO agent.py:109 -- Dashboard agent grpc address: 0.0.0.0:63137
2022-04-08 23:45:03,468 ERROR agent.py:150 -- Raylet is dead, exiting.

===========================================================

raylet.err

[2022-04-08 23:45:03,568 E 109 167] (raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
gcesars commented 2 years ago

Is there any alternative way to monitor the resources? I tried ray monitor but it requires a cluster.yaml file that I can't figure out how to create, although I am able to start the head and workers in a few different machines.

clarkzinzow commented 2 years ago

Have you made sure that all requisite ports are open, for the dashboard in particular? https://docs.ray.io/en/master/ray-core/configure.html#head-node

gcesars commented 2 years ago

yes 6379, 10001, 8265 I've deployed a remote container so many times that those ports should be in my epitaph. It connects to the dashboard and fetches the webpack file. Only when it tries to fetch the jobs/nodes/etc it generates the console log error I mentioned in the first message and everything inside disappears.

clarkzinzow commented 2 years ago

Hmm I'm not sure what the issue could be without digging further. At this point, we should loop in the Experience team (owner of the dashboard) so they can triage this issue and help out.

cc @alanwguo

gcesars commented 2 years ago

it is interesting that this problem only happens if the head node is in a container. If I run it in baremetal, all work as expected.

gcesars commented 2 years ago

Quick update:

skabbit commented 2 years ago

yes 6379, 10001, 8265 I've deployed a remote container so many times that those ports should be in my epitaph. It connects to the dashboard and fetches the webpack file. Only when it tries to fetch the jobs/nodes/etc it generates the console log error I mentioned in the first message and everything inside disappears.

I've got exactly the same. And yes, head/workers is containerized and tried original image for that already.

skabbit commented 2 years ago

Fixed in 1.12.0 (or in latest ray docker image?). Now just works fine.

scottsun94 commented 2 years ago

@skabbit @giucesar In ray 2.0, we made the previous experimental dashboard the default dashboard. Not sure if you have upgraded to 2.0, but are you still running into this issue in the version you use?

skabbit commented 2 years ago

We upgraded to 2.0 and new dashboard is just great, thanks 🥇