ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34k stars 5.78k forks source link

[Ray Core] cannot get history job log and task summaries after head node re-create #44561

Open shadowdsp opened 7 months ago

shadowdsp commented 7 months ago

What happened + What you expected to happen

I'm using RayCluster with KubeRay, and I have configured GCS persistence with redis refers to kuberay gcs FT. I found once the raycluster head node is re-created, the historical job logs and task summaries in dashboard were lost even if I configured GCS persistence. Is it expected? And how can I view the historical job logs and task sumaries in dashboard after the head node re-create?

Cannot get the task summaries in old head node jobs. Jobs View

Cannot get the job logs. Job Logs View

Versions / Dependencies

ray 2.9.0 kuberay-operator 1.0.0 python 3.8

Reproduction script

  1. create a raycluster and submit a job.
  2. kubectl delete the head node pod. (Now the job logs/tasks in dashboard will be lost)
  3. submit a new job. (The new job logs/tasks exists)

Issue Severity

High: It blocks me from completing my task.

jjyao commented 6 months ago

Hi @shadowdsp this is expected now since these information is not persisted to redis so they will be lost after head node re-create

shadowdsp commented 5 months ago

Hi @shadowdsp this is expected now since these information is not persisted to redis so they will be lost after head node re-create

Thank you @jjyao . Is there any plan to persist these information in the future? Now we need additional components to maintain historical task.