restatedev / restate

Restate is the platform for building resilient applications that tolerate all infrastructure faults w/o the need for a PhD.
https://docs.restate.dev
Other
1.63k stars 37 forks source link

Snapshot partition processor state to object store #1807

Open tillrohrmann opened 3 months ago

tillrohrmann commented 3 months ago

To support partition processor bootstrap, catching up stale processors after downtime (i.e. handle trim gaps), and to safely trim the log, we need snapshotting support.

Scope and features

How will snapshots be triggered? What is their frequency?

Where do snapshots go?

How are snapshots organized in the target bucket?

How will trimming be driven?

How will the Cluster Controller learn about what valid snapshots exist (and in which locations in the future)?

How will PPs be bootstrapped from a snapshot?

How will we handle trim gaps?

Who manages the lifecycle of snapshots?

Additional considerations:

Consider but don't implement

### Tasks
- [ ] https://github.com/restatedev/restate/issues/1892
- [ ] https://github.com/restatedev/restate/issues/1894
- [ ] https://github.com/restatedev/restate/issues/2246
- [ ] https://github.com/restatedev/restate/issues/2197
- [ ] https://github.com/restatedev/restate/issues/2000
- [ ] https://github.com/restatedev/restate/issues/2247
- [ ] https://github.com/restatedev/restate/issues/1812
pcholakov commented 1 week ago

Rough notes from chatting with @tillrohrmann:

Out of scope for now:

Open questions:


Some thoughts on the open questions:

how will we handle multi-region / geo-replicated support?

I think we can leave this out of scope for now and only manage it in the object store config; S3 and Azure Blob store both support async cross-region replication. For something like snapshots where picking a slightly older one to bootstrap from is ok, this is completely acceptable. In the worst case, new PPs won't be able to star up in a region whose snapshot bucket replication is running well behind the log tail. And a region in such condition will likely be experiencing other difficulties beyond just snapshot staleness.

how will the snapshot lifecycle be managed?

My 2c: we should upload snapshots and update them again, leaving this to be managed via object store policies. For example, S3 supports rich lifecycle policies to migrate objects to cheaper storage classes, or delete them after a while. The one exception is local directory snapshots. Assuming those are used only for short-lived test clusters, we shouldn't have long-term disk usage problems with them.

pcholakov commented 1 week ago

Updated issue description based on our internal discussion yesterday.