Open jjyao opened 7 months ago
Hi
This is my observation:
Create ray cluster on a EKS cluster with external redis for GCS.
Crash the head node by sending a potential bad job to it. For example: you can let the head node has a tiny resources but run a heavy work (without ray.remote)
The head node is crashed and new head node joined and try to reconnect to GCS
head node will show no data in the job list even by sending a job there, however, if you flush the redis and send a new job, you will be able to see the job from the list.
I havent find a very simple way to reproduce this yet.
What happened + What you expected to happen
GCS FT lost all the old job history after head node recovery
Versions / Dependencies
master
Reproduction script
N/A
Issue Severity
None