Closed pdames closed 1 year ago
This work is required as part of https://github.com/ray-project/ray_beam_runner/issues/2
The state manager should also support efficient, atomic checkpointing and restoration of all state persisted in Ray's in-memory object store to durable storage (e.g. on-disk or to a durable cloud storage service etc.).
This is interesting. We might have to do some work on improving the semantics of ObjectRefs serialization (in particular the interaction with ref-counting), since right now they're pinned forever in memory if exported. Hence, checkpointing may cause these objects to be leaked in the object store. cc @jjyao
@pabloem prototype: https://github.com/ray-project/ray_beam_runner/pull/6
@iasoon implemented something like this
The Ray Pipeline State Manager is a central service that consolidates the execution state of all scheduled pipeline work items in Ray's object store. This should be based on the single-process current implementation used in Beam's FnApiRunner.
At a high-level, this should be a Ray Actor that worker tasks use for (1) durable persistence of any
ObjectRef
that they have persisted in Ray's object store viaref = ray.put(obj)
and (2) on-demand retrieval of any persistedObjectRef
which they can materialize viaobj = ray.get(ref)
.The state manager should also support efficient, atomic checkpointing and restoration of all state persisted in Ray's in-memory object store to durable storage (e.g. on-disk or to a durable cloud storage service etc.).