[Core] GCS FT lost all the old job history after head node recovery

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Apache License 2.0

33.76k stars 5.74k forks source link

Hi This is my observation: Create ray cluster on a EKS cluster with external redis for GCS.
Crash the head node by sending a potential bad job to it. For example: you can let the head node has a tiny resources but run a heavy work (without ray.remote) The head node is crashed and new head node joined and try to reconnect to GCS head node will show no data in the job list even by sending a job there, however, if you flush the redis and send a new job, you will be able to see the job from the list.

I havent find a very simple way to reproduce this yet.

ray-project / ray

[Core] GCS FT lost all the old job history after head node recovery #44218

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity