ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.3k stars 5.5k forks source link

[core] potential issues when workers are not properly returned to worker pool #32271

Open clarng opened 1 year ago

clarng commented 1 year ago

What happened + What you expected to happen

Per PR https://github.com/ray-project/ray/pull/32217

It is possible that the returning of the worker fails. There is potential bug that could trigger. Look into improving the protocol, for example, to rely on heartbeat for worker ownership.

rkooo567 commented 1 year ago

First step: Add logging or check failure

scv119 commented 1 year ago

TODO: Add a check in release test to get a repro.

clarng commented 1 year ago

Seems to repro very consistently with the PR with the flag on

stale[bot] commented 11 months ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.