Closed gcesars closed 1 year ago
As far as I know, only the dashboard has an issue. I can connect to the node with my other machines and submit jobs normally.
Tested in Windows 11 and Docker on the same machine. In Windows, it works as expected, in Docker the dashboard shows the same unexpected behaviour.
Same here, but using Ray Cluster Launcher. Suppose, the problem is in ray docker image used for Ray Operator. Image was updated recently and the problem seems to appear just after one of updates.
I created my own image from base Debian. It looks like to be an issue with the dashboard hosted in a virtualized environment, not the image itself. I will see if a simple demo React app shows similar behaviour if time allows.
Another update. This is the stack trace on dashboar.log
Traceback (most recent call last):
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
reply = await stub.GetNodeStats(
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1649434759.726616703","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1649434759.726613713","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
@giucesar Have you tried using one of the nightly wheels, to see if this issue still comes up on latest master?
It does not even start.
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
2022-04-08 17:00:32,442 INFO services.py:1460 -- View the Ray dashboard at http://172.17.0.2:8265
Traceback (most recent call last):
File "/home/USERNAMEREDACTED/miniconda//bin/ray", line 8, in <module>
sys.exit(main())
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2269, in main
return cli()
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
return f(*args, **kwargs)
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/scripts/scripts.py", line 719, in start
node = ray.node.Node(
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/node.py", line 301, in __init__
self.start_ray_processes()
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/node.py", line 1130, in start_ray_processes
resource_spec = self.get_resource_spec()
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/node.py", line 472, in get_resource_spec
self._resource_spec = ResourceSpec(
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/resource_spec.py", line 197, in resolve
system_memory = ray._private.utils.get_system_memory()
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/utils.py", line 436, in get_system_memory
docker_limit = int(f.read())
ValueError: invalid literal for int() with base 10: 'max\n'
2022-04-08 17:00:26,348 INFO scripts.py:697 -- Local node IP: 172.17.0.2
Ah I've seen this issue when running Ray in Colab, it appears to be a bug in our memory limit utility for cgroups v2 which is somehow untested. This can be fixed by monkey-patching that utility before calling ray.init()
:
import psutil
ray._private.utils.get_system_memory = lambda: psutil.virtual_memory().total
I understand what you said, just my noobness doesn't have any idea in how to do it :)
Ah I see that you're starting your Ray cluster from the CLI, that makes this monkey-patch a bit more difficult... One option would be using the auto-import feature of Python's site machinery by creating a sitecustomize.py
module in your site-packages
directory (e.g. ~/.local/lib/python3.7/site-packages
) that does this monkey-patching at import-time; this will then be imported before your script is run.
# sitecustomize.py
import os
if "RAY_MONKEY_PATCH_MEMORY_LIMIT" in os.environ:
import ray
ray._private.utils.get_system_memory = lambda: psutil.virtual_memory().total
Then when you're starting Ray from the CLI, set that environment variable:
RAY_MONKEY_PATCH_MEMORY_LIMIT=1 ray start --head --num-cpus ${NUM_CPUS} --num-gpus ${NUM_GPUS} --dashboard-host 0.0.0.0 --verbose --redis-password ${REDIS_PASSWORD} --block \
--log-color false --log-style record
I haven't tried this out, but I think that something like this should work.
Implemented the patch (thanks for the tip). Now, a new error happens 🤦🏻♂️ It looks like the version ray-2.0.0.dev0-cp39-cp39-manylinux2014_x86_64.whl will not help us to identify the problem since it has some other issues.
2022-04-08 18:42:06,270 INFO scripts.py:697 -- Local node IP: 172.17.0.2
2022-04-08 18:42:12,917 SUCC scripts.py:739 -- --------------------
2022-04-08 18:42:12,918 SUCC scripts.py:740 -- Ray runtime started.
2022-04-08 18:42:12,920 SUCC scripts.py:741 -- --------------------
2022-04-08 18:42:12,920 INFO scripts.py:743 -- Next steps
2022-04-08 18:42:12,920 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2022-04-08 18:42:12,921 INFO scripts.py:747 -- ray start --address='172.17.0.2:6379'
2022-04-08 18:42:12,923 INFO scripts.py:752 -- Alternatively, use the following Python code:
2022-04-08 18:42:12,924 INFO scripts.py:754 -- import ray
2022-04-08 18:42:12,924 INFO scripts.py:758 -- ray.init(address='auto')
2022-04-08 18:42:12,924 INFO scripts.py:770 -- To connect to this Ray runtime from outside of the cluster, for example to
2022-04-08 18:42:12,924 INFO scripts.py:774 -- connect to a remote cluster from your laptop directly, use the following
2022-04-08 18:42:12,925 INFO scripts.py:778 -- Python code:
2022-04-08 18:42:12,925 INFO scripts.py:780 -- import ray
2022-04-08 18:42:12,928 INFO scripts.py:781 -- ray.init(address='ray://<head_node_ip_address>:10001')
2022-04-08 18:42:12,929 INFO scripts.py:790 -- If connection fails, check your firewall settings and network configuration.
2022-04-08 18:42:12,929 INFO scripts.py:798 -- To terminate the Ray runtime, run
2022-04-08 18:42:12,929 INFO scripts.py:799 -- ray stop
2022-04-08 18:42:12,929 INFO scripts.py:874 -- --block
2022-04-08 18:42:12,930 INFO scripts.py:875 -- This command will now block until terminated by a signal.
2022-04-08 18:42:12,930 INFO scripts.py:878 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2022-04-08 18:42:15,938 ERR scripts.py:889 -- Some Ray subprcesses exited unexpectedly:
2022-04-08 18:42:15,939 ERR scripts.py:893 -- raylet [exit code=1]
2022-04-08 18:42:15,939 ERR scripts.py:901 -- Remaining processes will be killed.
Could you check the raylet error logs? These should be located at /tmp/ray/session_latest/logs/raylet.err
, so cat /tmp/ray/session_latest/logs/raylet.err
should suffice.
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py:163: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(aiohttp.__version__) < LooseVersion("4.0.0"):
Traceback (most recent call last):
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
loop.run_until_complete(agent.run())
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
modules = self._load_modules()
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
c = cls(self)
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
self._metrics_agent = MetricsAgent(
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
prometheus_exporter.new_stats_exporter(
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
self.serve_http()
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
start_http_server(
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/prometheus_client/exposition.py", line 167, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/prometheus_client/exposition.py", line 156, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/agent.py", line 407, in <module>
gcs_publisher = GcsPublisher(args.gcs_address)
TypeError: __init__() takes 1 positional argument but 2 were given
@giucesar Ah yeah master was broken in the last few days by an updated unpinned dependency, you need to pip install -U prometheus_client==0.13.1
.
https://github.com/ray-project/ray/issues/23799#issuecomment-1093082364
With the prometheus version and monkey patch, the ray head started. It shows the same behaviour.
dashboard.log
2022-04-08 23:45:34,368 ERROR node_head.py:259 -- Error updating node stats of c34a1e9d2426d47584d1e8767fc381b57df6d77d668f30cabca21f07.
Traceback (most recent call last):
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 250, in _update_node_stats
reply = await stub.GetNodeStats(
File "/home/USERNAMEREDACTED/miniconda/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1649461534.359356387","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1649461534.359163720","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
===========================================================
dashboard_agent.log
2022-04-08 23:45:03,449 INFO agent.py:83 -- Parent pid is 109
2022-04-08 23:45:03,464 INFO agent.py:109 -- Dashboard agent grpc address: 0.0.0.0:63137
2022-04-08 23:45:03,468 ERROR agent.py:150 -- Raylet is dead, exiting.
===========================================================
raylet.err
[2022-04-08 23:45:03,568 E 109 167] (raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
Is there any alternative way to monitor the resources? I tried ray monitor but it requires a cluster.yaml file that I can't figure out how to create, although I am able to start the head and workers in a few different machines.
Have you made sure that all requisite ports are open, for the dashboard in particular? https://docs.ray.io/en/master/ray-core/configure.html#head-node
yes 6379, 10001, 8265 I've deployed a remote container so many times that those ports should be in my epitaph. It connects to the dashboard and fetches the webpack file. Only when it tries to fetch the jobs/nodes/etc it generates the console log error I mentioned in the first message and everything inside
Hmm I'm not sure what the issue could be without digging further. At this point, we should loop in the Experience team (owner of the dashboard) so they can triage this issue and help out.
cc @alanwguo
it is interesting that this problem only happens if the head node is in a container. If I run it in baremetal, all work as expected.
Quick update:
yes 6379, 10001, 8265 I've deployed a remote container so many times that those ports should be in my epitaph. It connects to the dashboard and fetches the webpack file. Only when it tries to fetch the jobs/nodes/etc it generates the console log error I mentioned in the first message and everything inside disappears.
I've got exactly the same. And yes, head/workers is containerized and tried original image for that already.
Fixed in 1.12.0 (or in latest ray docker image?). Now just works fine.
@skabbit @giucesar In ray 2.0, we made the previous experimental dashboard the default dashboard. Not sure if you have upgraded to 2.0, but are you still running into this issue in the version you use?
We upgraded to 2.0 and new dashboard is just great, thanks 🥇
What happened + What you expected to happen
The dashboard does not render correctly, only a blank screen with a console error.
Versions / Dependencies
ray: 1.10.0 python: 3.9.12 node: 17.8.0 kernel: 5.10.93 os: Debian Bullseye
Reproduction script
There is nothing special I am doing, just running the command below in the container. Running the same command in my mac M1, it works as expected.