Open jhasm-ck opened 1 year ago
cc @richardliaw
This is more related to persistent dashboard: exporting the info from GCS that powers that dashboard.
There has been related discussion: @alanwguo @gvspraveen
cc @akshay-anyscale
Any update to this feature? We spin up/teardown Ray clusters frequently and would to be able to persist jobs/actor history.
cc: @anyscalesam
This seems like a valid use-case. Has any thought been given to de-coupling the dashboard from Ray clusters? On kubernetes deployments a single cluster-wide dashboard which can display information from multiple RayCluster resources would make sense.
Any update to this feature?
cc: @nikitavemuri @alanwguo
Any update to this feature?
Hi all, we've been actively developing a push-based solution for exporting state and metadata for various Ray resources. We have an alpha version of this feature for some core Ray resources, but we've observed some performance regressions in task scheduling latency and throughput. We are currently investigating these issues and exploring alternative architectures that may offer better performance. Expecting to provide an update on this work in Q1 next year.
Description
The above two factors together mean the ray dashboard information and job history is very limited in use. All the work put in to improve the dashboard has limited usefulness. To use the full potential of the history and other dashboard information, they need to have a longer life than the cluster itself. We do support exporting logs and metrics through popular integrations, but there is no support for exporting job, actor and task history along with all the rich debugging information we have on the dashboard.
I suggest a mechanism to listen to events triggered from the ray cluster, so that the listener can store the history in a more durable storage of their choice. Pulling information from the cluster is not reliable because the clusters can go up and down. A push mechanism is more reliable and resource efficient in this situation.
Use case