Open rickyyx opened 1 year ago
We will try reproducing it
@rickyyx can you provide me a repro script?
cc @rickyyx any follow up for the repro? We will tag it as P2 until repro is found
Sorry - I will try to work on a repro ASAP
We have started to observe this too. I've not tried to build a repro yet, but we see it on large production clusters where there is a lot of worker churn due to spot availability.
We are running Ray 2.6.3 conda packages.
Hmm @rickyyx do you think you will have time in Ray 2.9 to start making a repro script?
What happened + What you expected to happen
Original issue reported https://discuss.ray.io/t/very-rare-error-that-occurs-when-nodes-disconnect-and-then-reconnect/10256/3
Versions / Dependencies
master
Reproduction script
Have ray cluster, and go on one worker node when a job is running. Now ray stop. Then after job completes try connect worker node again and relaunch job.
TODO
Issue Severity
None