Snapshot partition processor state to object store

To support partition processor bootstrap, catching up stale processors after downtime (i.e. handle trim gaps), and to safely trim the log, we need snapshotting support.

Scope and features

How will snapshots be triggered? What is their frequency?

PP leaders will create snapshots autonomously, based on accumulated commits (and not strictly on time passage)
we leave the door open for the cluster controller to orchestrate this with a global view of PP state and their regional placement in the future

Where do snapshots go?

we will use the object_store create to support various cloud object stores
this also supports a local filesystem target for testing

How are snapshots organized in the target bucket?

we will use a {partition_id}/{lsn} structure that allows for efficient listing of the most recent snapshot(s)
S3, GCP, and Azure object stores all list bucket keys in lexicographical order with no option to reverse the sorting; we will likely need to come up with an explicit sort key which naturally puts latest snapshots at the beginning of the list stream, to avoid paginating over many results, so in practice the storage prefix may be something like {partition_id}/{reversed_lsn}/{at_least_snapshot_lsn}
this allows us to fall back on autonomous bootstrap - if a node is appropriately configured, it should be able to bootstrap and catch up to the log absent CC

How will trimming be driven?

The CC will use snapshot information (reported by PPs) to trigger log trimming

How will the Cluster Controller learn about what valid snapshots exist (and in which locations in the future)?

PPs will include the latest known snapshot LSN they are aware of (either because they bootstrapped from it, or because they uploaded it themselves) in the cluster status response message
this will support region-awareness to trim decisions in the future (i.e. only trim at a point which is snapshotted across multiple regions)

How will PPs be bootstrapped from a snapshot?

the bucket location will be known and all nodes are expected to to have access to it (even across regional boundaries)
on restore, the last-applied-lsn within the snapshot CF becomes the last-archived-lsn property for the PP

How will we handle trim gaps?

using the same mechanism as bootstrap - when PPs encounter a trim gap in the log, they will need to revert to the object store snapshot bucket to find a more recent state snapshot to recover from

Who manages the lifecycle of snapshots?

we will leave this up to the operator to configure an object store lifecycle policy; all cloud providers support rich automated management with multiple storage/archival tiers

Additional considerations:

Avoid interacting with the object store on startup (unless we are bootstrapping)
Instead, each PP will track the last known snapshot LSN locally
Avoid time-based/periodic snapshots, we can snapshot based on number of records since last snapshot, or number of bytes, etc.
How will bootstrapping a region work, once we already have some up? (Likely either by cross-regional initial snapshot ingestion, or by the operator manually seeding a regional snapshot bucket from another region)
Should we drop the current SnapshotIds? No; even though we don't intend to use them as keys in the snapshot object store layout, it's useful to have a unique "event id" for the snapshot generation to e.g. track down logs related to its creation. It only needs to exist within the snapshot metadata uploaded to the store

Consider but don't implement

for geo-distributed deployments, we will have independent buckets in each region, and have region-aware snapshot triggering to ensure that we don't rely on any async replication mechanism (that can still be enabled as another layer of defense by the operator)
how do we protect the only/last snapshot from being deleted by aggressive object store lifecycle management policies

### Tasks
- [ ] https://github.com/restatedev/restate/issues/1892
- [ ] https://github.com/restatedev/restate/issues/1894
- [ ] https://github.com/restatedev/restate/issues/2246
- [ ] https://github.com/restatedev/restate/issues/2197
- [ ] https://github.com/restatedev/restate/issues/2000
- [ ] https://github.com/restatedev/restate/issues/2247
- [ ] https://github.com/restatedev/restate/issues/1812

Rough notes from chatting with @tillrohrmann:

snapshots should be driven by RPC from the cluster controller; pick a random "reasonably fresh" node prioritizing non-leaders
- snapshotting policy can thus be centrally managed on the admin node(s)
- the request command to generate a snapshot could include the expected minimum LSN to improve safety (i.e. if we request a stale PP to make snapshot, it should decline)
use a directory/key prefix layout like $base/<partition_id>/<lsn>/... for easy navigation
access to object store (if used) is uniform: all nodes have read/write access to the same bucket; this can be fine-tuned down the line

Out of scope for now:

incremental snapshots, reusing prior snapshots' SSTs

Open questions:

how will we handle multi-region / geo-replicated support?
how will the snapshot lifecycle be managed?
how will PPs be bootstrapped from the snapshot store? does the cluster controller get involved?

Some thoughts on the open questions:

how will we handle multi-region / geo-replicated support?

I think we can leave this out of scope for now and only manage it in the object store config; S3 and Azure Blob store both support async cross-region replication. For something like snapshots where picking a slightly older one to bootstrap from is ok, this is completely acceptable. In the worst case, new PPs won't be able to star up in a region whose snapshot bucket replication is running well behind the log tail. And a region in such condition will likely be experiencing other difficulties beyond just snapshot staleness.

how will the snapshot lifecycle be managed?

My 2c: we should upload snapshots and update them again, leaving this to be managed via object store policies. For example, S3 supports rich lifecycle policies to migrate objects to cheaper storage classes, or delete them after a while. The one exception is local directory snapshots. Assuming those are used only for short-lived test clusters, we shouldn't have long-term disk usage problems with them.

restatedev / restate

Snapshot partition processor state to object store #1807

Scope and features

Consider but don't implement