ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.99k stars 5.77k forks source link

[Core] Dashboard internal server error from many actor many nodes release test #42794

Open rkooo567 opened 9 months ago

rkooo567 commented 9 months ago

What happened + What you expected to happen

(DashboardTester pid=5919) 500 Server Error: Internal Server Error for url: http://10.0.27.228:8265/logical/actors
(DashboardTester pid=5919) Traceback (most recent call last):
(DashboardTester pid=5919)   File "/tmp/ray/session_2024-01-26_02-16-10_676676_77/runtime_resources/working_dir_files/s3_ray-release-automation-results_working_dirs_many_nodes_actor_test_on_v2_aws_djolpzidnk__anyscale_pkg_195bc952b6d3759e93852aac6a1e846f/distributed/dashboard_test.py", line 86, in ping
(DashboardTester pid=5919)     resp.raise_for_status()
(DashboardTester pid=5919)   File "/home/ray/anaconda3/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
(DashboardTester pid=5919)     raise HTTPError(http_error_msg, response=self)
(DashboardTester pid=5919) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://10.0.27.228:8265/logical/actors
(raylet) WARNING: 8 PYTHON worker processes have been started on node: 9653ebd1108c6047b9d5a71013f90a3c592e8cf9f4cb230e645ccbc7 with address: 10.0.41.209. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds). [repeated 135x across cluster]

https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_v1ydmgsb1ucvh3xgkb65c1jdyk

Versions / Dependencies

n/a

Reproduction script

master

Issue Severity

None

hongchaodeng commented 8 months ago

After digging into the cluster's logs, I found that there are many Deadline Exceeded errors in dashboard logs. There are no obvious errors or exceptions from GCS server and Raylet. Might be transient network issues.

This is also transient on release tests. No such failures from recent test results.

My suggestion is to downgrade it to P1/P2.