ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
974 stars 330 forks source link

[Feature] [API Server] [RFC] Add persistence for job history using a SQL database #2114

Open han-steve opened 2 months ago

han-steve commented 2 months ago

Search before asking

Description

Hello KubeRay community, thanks for developing the API Server component! I'm new here, and I want to collect some thoughts about implementing persistent storage for API Server. According to the API Server design doc,

we want to leave some flexibility to use database to store history data in the near future (for example, pagination, list options etc)

Right now, past Ray jobs are stored as CRDs in the Kubernetes etcd database, and the API Server queries the CRDs directly. This doesn't seem to be as scalable as a solution backed by a SQL database. I can think of two ways to implement this:

  1. Use a "persistent agent" to watch for changes in the CRDs and sync them with a database. Delete the CRDs when they reach a terminal state. Clients can directly list the jobs from the database instead of querying the CRDs. This is what Kubeflow pipeline does (architecture diagram, codebase).
  2. Keep using the CRDs when the job is running, but as soon as the job finishes, we snapshot it and store it in the database before cleaning up the CRD.

Do people think that having a database is important? If so, do we have a plan to implement this?

Use case

Support keeping track of job history without leaving a lot of CRDs in Kubernetes.

Related issues

https://github.com/ray-project/kuberay/issues/312

Are you willing to submit a PR?