Open clarng opened 1 year ago
First step: Add logging or check failure
TODO: Add a check in release test to get a repro.
Seems to repro very consistently with the PR with the flag on
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
What happened + What you expected to happen
Per PR https://github.com/ray-project/ray/pull/32217
It is possible that the returning of the worker fails. There is potential bug that could trigger. Look into improving the protocol, for example, to rely on heartbeat for worker ownership.