Closed clarkzinzow closed 2 years ago
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
TL;DR: When deserializing a remote function handle on a different cluster, we should trigger re-exporting of the remote function. Currently, this will sometimes not happen, and importing the function will hang forever since the remote function is never exported on the new cluster.
Context
Tune needs to support resuming a stopped experiment on a different cluster; in order to achieve this, everything that's serialized when pausing the experiment, such as currently set hyperparameters, needs to be able to be deserialized on a new cluster. In order to support tuning datasets (e.g. k-fold cross-validation), we need to be able to support cross-cluster deserializing of
Dataset
objects. We're currently purging object refs when serializing for such experiment stopping, since these can't be used across clusters, but the current architecture of Datasets makes it difficult to purge remote function handles, so it would be great if these could be deserialized across clusters.Problem
Currently, when serializing remote function handles for cross-cluster use, we're having to unset the
_last_export_session_and_job
field to force all remote functions to be re-exported when the handle is deserialized on a new cluster, since otherwise Ray will hang indefinitely trying to import a function that's never been exported. This is because_last_export_session_and_job
is not unique across clusters, so the logic that triggers re-exporting of the remote function when the current session + job is different than that on the handle fails to do so when deserializing across Ray clusters, even though the function has not yet been exported on this cluster.Possible Solutions/Improvements
If possible, we should try to either:
_last_export_session_and_job
.