ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.93k stars 5.77k forks source link

[Core] Support deserializing remote function handles across clusters #24152

Closed clarkzinzow closed 2 years ago

clarkzinzow commented 2 years ago

TL;DR: When deserializing a remote function handle on a different cluster, we should trigger re-exporting of the remote function. Currently, this will sometimes not happen, and importing the function will hang forever since the remote function is never exported on the new cluster.

Context

Tune needs to support resuming a stopped experiment on a different cluster; in order to achieve this, everything that's serialized when pausing the experiment, such as currently set hyperparameters, needs to be able to be deserialized on a new cluster. In order to support tuning datasets (e.g. k-fold cross-validation), we need to be able to support cross-cluster deserializing of Dataset objects. We're currently purging object refs when serializing for such experiment stopping, since these can't be used across clusters, but the current architecture of Datasets makes it difficult to purge remote function handles, so it would be great if these could be deserialized across clusters.

Problem

Currently, when serializing remote function handles for cross-cluster use, we're having to unset the _last_export_session_and_job field to force all remote functions to be re-exported when the handle is deserialized on a new cluster, since otherwise Ray will hang indefinitely trying to import a function that's never been exported. This is because _last_export_session_and_job is not unique across clusters, so the logic that triggers re-exporting of the remote function when the current session + job is different than that on the handle fails to do so when deserializing across Ray clusters, even though the function has not yet been exported on this cluster.

Possible Solutions/Improvements

If possible, we should try to either:

  1. Support cross-cluster deserialization of remote functions by adding a unique-per-cluster prefix to _last_export_session_and_job.
  2. Detect deserialization of a remote function on a new (different) cluster and error eagerly.
  3. In the import loop, make this fail loudly after some amount of time instead of hanging indefinitely.
stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!