ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.21k stars 5.81k forks source link

[Dashboard] Decoupling dashboard and dashboard lifetime from Ray Cluster #46444

Open Superskyyy opened 4 months ago

Superskyyy commented 4 months ago

Description

With Ray starting to support the virtual cluster (vCluster) concept and we are seeing advanced multi-cluster per user setups, the Ray dashboard components should not be bound to a single Ray cluster's lifetime anymore, since it makes multi-tenancy sharing and telemetry data persistence complex to implement. Plus that the dashboard would go down together if the head node goes down (fate-sharing), making it difficult to backtrack what happened (and what was executing) during a major incident. @liuxsh9 @Bye-legumes @nemo9cby

Use case

Doing so will bring below benefits:

  1. Dashboard can optionally read from a persistence history server (observability database) instead of pulling directly from a running GCS. (GCS/HA redis writes to persistence store)
  2. Dashboard side overhead will not accidentally bring down the head node.
  3. Users can attach their own external monitoring platforms same way as job dashboard, to manage large amount of clusters.
  4. Each user gets their dashboard, which can be multi physical cluster or vclusters.
  5. Allow checking dashboard even after a cluster was preempted/shutdown.
Bye-legumes commented 4 months ago

similar issue https://github.com/ray-project/ray/issues/45940 Maybe we can decouple in this way so that we can achieve persist storage and the dashboard can control the tasks. image

yucai commented 4 months ago

Share we have a abstraction layer in front of DataBase? So that, different DB solution can be used. @anyscalesam , @Bye-legumes I heard ByteDance had some solution already, kindly share with me if you have. Thanks a lot!

anyscalesam commented 4 months ago

let's grab time to chat more about this cc @alanwguo

UPDATE: focus on getting Export API working first which is the natural pre-req to this. REP in progress with @MissiontoMars @nikitavemuri