ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.76k stars 5.74k forks source link

[Core] GCS FT lost all the old job history after head node recovery #44218

Open jjyao opened 7 months ago

jjyao commented 7 months ago

What happened + What you expected to happen

GCS FT lost all the old job history after head node recovery

Versions / Dependencies

master

Reproduction script

N/A

Issue Severity

None

brucebismarck commented 7 months ago

Hi This is my observation: Create ray cluster on a EKS cluster with external redis for GCS.
Crash the head node by sending a potential bad job to it. For example: you can let the head node has a tiny resources but run a heavy work (without ray.remote) The head node is crashed and new head node joined and try to reconnect to GCS head node will show no data in the job list even by sending a job there, however, if you flush the redis and send a new job, you will be able to see the job from the list.

I havent find a very simple way to reproduce this yet.