ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.06k stars 5.78k forks source link

[jobs] Support export of ray jobs and actors history to external storage #39503

Open jhasm-ck opened 1 year ago

jhasm-ck commented 1 year ago

Description

  1. The ray jobs, actors, tasks history and all the information showed on the ray dashboard is currently limited to a running ray cluster and is lost when the cluster is deleted.
  2. The ray clusters are designed to be multi-tenant or long running, but designed to be created to run a few jobs and then destroyed.

The above two factors together mean the ray dashboard information and job history is very limited in use. All the work put in to improve the dashboard has limited usefulness. To use the full potential of the history and other dashboard information, they need to have a longer life than the cluster itself. We do support exporting logs and metrics through popular integrations, but there is no support for exporting job, actor and task history along with all the rich debugging information we have on the dashboard.

I suggest a mechanism to listen to events triggered from the ray cluster, so that the listener can store the history in a more durable storage of their choice. Pulling information from the cluster is not reliable because the clusters can go up and down. A push mechanism is more reliable and resource efficient in this situation.

Use case

  1. To be able to see the history of jobs run by me or my team, independent of which ray cluster was used.
  2. Being able to persist the job information without keeping the ray cluster running.
  3. Being able to see the history of all jobs in one place, across all ray clusters.
  4. Being able to treat the kuberay clusters as lightweight resources, without loosing the history of jobs and debugging information.
  5. Being able to record the ray job information along with the rest of the model logs and artifacts.
jhasm-ck commented 1 year ago

cc @richardliaw

scottsun94 commented 1 year ago

This is more related to persistent dashboard: exporting the info from GCS that powers that dashboard.

There has been related discussion: @alanwguo @gvspraveen

shrekris-anyscale commented 1 year ago

cc @akshay-anyscale

azai91 commented 7 months ago

Any update to this feature? We spin up/teardown Ray clusters frequently and would to be able to persist jobs/actor history.

scottsun94 commented 7 months ago

cc: @anyscalesam

askulkarni2 commented 7 months ago

This seems like a valid use-case. Has any thought been given to de-coupling the dashboard from Ray clusters? On kubernetes deployments a single cluster-wide dashboard which can display information from multiple RayCluster resources would make sense.

kadisi commented 3 weeks ago

Any update to this feature?

scottsun94 commented 3 weeks ago

cc: @nikitavemuri @alanwguo

xiaoming12306 commented 2 weeks ago

Any update to this feature?

nikitavemuri commented 1 week ago

Hi all, we've been actively developing a push-based solution for exporting state and metadata for various Ray resources. We have an alpha version of this feature for some core Ray resources, but we've observed some performance regressions in task scheduling latency and throughput. We are currently investigating these issues and exploring alternative architectures that may offer better performance. Expecting to provide an update on this work in Q1 next year.