[RFC] Simplified Snapshot v2 with Timestamp Pinning in Remote Store

Goal

Today, for cluster with remote backed storage feature, we use a variant of snapshot, called as shallow snapshot. Shallow snapshots refer data that is already uploaded as part of remote store. In order to prevent deletion of data in remote store that is referred by shallow snapshots, we need a locking mechanism that is used by remote store garbage collection. In this RFC, we discuss current locking mechanism and its shortcomings and propose a new mechanism that scales independent of number of shards/indices/nodes in the cluster. We also discuss how this new approach can be evolved into PITR (point-in-time restore).

This RFC only covers remote store side changes to keep the concern separated and RFC scope limited. Snapshot flow changes will be covered in another RFC.

Current Locking Mechanism

Segments are uploaded to remote store post each refresh. After segments are uploaded successfully, we upload a checkpoint (referred as segment metadata) which refers to all the active segments for the given shard at the point of refresh.
In order to lock segment files, we just need to lock the corresponding segment metadata file in the remote store.
In current locking mechanism, lock is acquired on latest metadata file by creating <metadata_filename>__<snapshot_id> file under lock directory in remote store.
Segment garbage collection skips deletion of locked metadata (and corresponding segment) files. Example is provided in the Appendix: Remote Store Garbage Collection Example
Translog garbage collection is not lock aware. This works as snapshot triggers explicit flush to ensure that all the writes are part of segments.

Sequence Diagram

Snapshot_Creation_Current_GH

Issues with Current Locking Mechanism

With shallow snapshots, we lose incremental nature of snapshot. Even if shard does not receive any traffic since last snapshot, new snapshot will acquire new lock on the same metadata file that has lock from the earlier snapshot.
Due to this, if there are 100K shards in a cluster, each snapshot will create 100K lock files even if only 100 shards were actively receiving indexing throughput.
This also results in too many calls to remote store during snapshot creation as well as deletion and supposedly lightweight shallow snapshots becomes bulky .

Requirements

Remote store calls per shard on snapshot creation to take a lock is not scalable. Locking mechanism should scale independent of number of shards.
With right combination of granularity and retention, reducing/increasing snapshot granularity should not impact performance or stability of the cluster.
Should align with long term vision of PITR (point in time restore)

Timestamp Based Implicit Locking

In this approach, we will move away from explicit lock file creation for a given metadata file. Instead, we will use timestamp in metadata filename to acquire implicit lock (Refer Metadata Filename Format section in Appendix for more details on metadata filename). We call it Timestamp Pinning.

Proposed Pinned Timestamp Format

Timestamp     Pinning Entity
1722415819000 Snapshot1
1722425818000 Snapshot2, Snapsot3
1722435817000 Snapshot4
1722445816000 Snapshot5
1722455815000 Snapshot6

Approach

We maintain a list of pinned timestamps at a cluster level. For each timestamp in this list, garbage collection for segment as well as translog will skip deletion of metadata file that matches (Appendix: Metadata file matching a timestamp) the pinned timestamp. To avoid triggering flush/refresh on each shard and handling potential failures, in this approach, we make translog garbage collector aware of snapshot locks.

Steps

Each data node keeps an in-memory data structure remote_store_pinned_timestamps
1. Exact approach will be covered in a separate GH issue.
Garbage collection of segment as well as translog will be modified as below:
1. If the last update time of remote_store_pinned_timestamps is > X mins, skip garbage collection.
2. Fetch metadata files in sorted order (which generally is reverse-chronological order but not always)
3. If timestamp of a metadata file md1 is > pinned_timestamp_a and the timestamp of next metadata file md2 <= pinned_timestamp_a add md2 to pinned_metadata_files
4. Skip deletion of metadata files in pinned_metadata_files and corresponding data files
5. Segment garbage collection will continue to skip locked metadata files for backward compatibility
Remote Store restore flow needs to be modified that will accept a timestamp to restore data to.

Sequence Diagram

Snapshot_Creation_Pinned_Timestamp_GH

Pros

Snapshot creation time is deterministic and does not increase with number of shards.
Snapshot creation can be centralised at cluster manager node resulting in reduced number of calls between cluster manager and data nodes.
As remote store guarantees durability of last successful write operation, snapshot for a shard would never result in failure (even if cluster is red). Failures can still happen while uploading snapshot metadata but snapshot status would either be successful or failed. We will not have partial snapshot state.

Cons

Need defensive approach while communicating changes to pinned timestamp list to all the nodes in the cluster. If garbage collector finds the timestamp list to be stale, it would skip the collection.
Timestamp is pinned at cluster level so would be common to all the indices in the cluster. Supporting snapshot for subset of indices will not be straightforward.

Extending the approach to PITR

As this approach uses timestamp based pinning, it can be extended to point-in-time restore. As pinning timestamp does not involve multiple remote store or node-node calls, we can support timestamp pinning at lower granularity. To avoid the synchronisation delay between pinning and communicating it to data node, in PITR, we can provide capability of fixed intervals. With this, we can support PITR granularity as low as 1 minute (we need to control retention based on granularity). Pinning the timestamp can still be supported for on-demand cases.

Appendix

Metadata file matching a timestamp

In this section, timestamps are provided in yyyy_MM_dd_HH_mm_ss format for readability purpose whereas actual timestamps would be epoch.

We call a metadata file matches timestamp T, if it has the max timestamp among all metadata files with timestamp at most T.
Example:
We want to find matching metadata file for timestamp 2024/07/05 17:00:00 and we have following options:
- metadata_2024_07_05_16_05_51
- metadata_2024_07_05_16_25_34
- metadata_2024_07_05_16_56_47
- metadata_2024_07_05_16_58_21
- metadata_2024_07_05_16_59_35
- metadata_2024_07_05_17_00_09
- metadata_2024_07_05_17_45_12
Last 2 are eliminated as they are above the timestamp.
Out of the remaining top 5 files, we take the one with highest timestamp.
metadata_2024_07_05_16_59_35 is considered as the metadata file that matches given timestamp T.

Metadata Filename Format

Remote Segment Store

Format - metadata__<Inverted Primary Term>__<Inverted Commit Generation>__<Inverted Translog Generation>__<Inverted Refresh Counter>__<Node ID>__<Inverted EPOCH>__<Metadata Version>
Example - metadata__9223372036854775806__9223372036854775796__9223372036854775647__9223372036854775883__-396831118__9223370334830299234__1

Remote Translog

Format - metadata__<Inverted Primary Term>__<Inverted Translog Generation>__<Inverted EPOCH>__<Node ID>__<Metadata Version>
Example - metadata__9223372036854775806__9223372036854775648__9223370334830643807__-396831118__1

Existing Remote Store Garbage Collection Example

// Metadata files in reverse chronological order
md01 -> S15
md02 -> S10, S13, S14 - Locked by snapshot S4
md03 -> S10, S13
md04 -> S10, S11, S12
md05 -> S10, S11
md06 -> s10
md07 -> s7, s8, s9 - Locked by snapshot S3
md08 -> s7, s8, s9
md09 -> s7, s8, s9
md10 -> s6, s7, s8
md11 -> s4, s5, s6
md12 -> s2, s3, s4 - Locked by snapshot S2
md13 -> s2, s3, s4
md14 -> s1, s2, s3 - Locked by snapshot S1

// Garbage Collection Steps
- Latest 10 metadata files are not considered in garbage collection
- Out of the remaining 4 metadata files, 2 are locked by snapshot, so they are filtered out.
- Metadata files eligible to be deleted: md11 and md13
- Segments that are referenced only by md11 or md13 - s5
- So, we delete segment s5 followed by deletion of metadata files md11 and md13

Remote Store Garbage Collection Example with Pinned Timestamps

// Metadata files in reverse chronological order
md01 -> S15
md02 -> S10, S13, S14 - Metadata created at 6PM - Shallow snapshot SS2
md03 -> S10, S13
md04 -> S10, S11, S12
md05 -> S10, S11
md06 -> s10
md07 -> s7, s8, s9 - Locked by snapshot S2
md08 -> s7, s8, s9
md09 -> s7, s8, s9
md10 -> s6, s7, s8
md11 -> s4, s5, s6
md12 -> s2, s3, s4 - Metadata created at 5PM - Shallow snapshot SS1
md13 -> s2, s3, s4
md14 -> s1, s2, s3 - Locked by snapshot S1

// Garbage Collection Steps
- Latest 10 metadata files are not considered in garbage collection
- Out of the remaining 4 metadata files, 2 md files are filtered out
  - md14 is locked by shallow snapshot that use explicit locking
  - md12 is used by shallow snapshot that use implicit locking
- Metadata files eligible to be deleted: md11 and md13
- Segments that are referenced only by md11 or md13 - s5
- So, we delete segment s5 followed by deletion of metadata files md11 and md13

Nice proposal @sachinpkale , the timestamp pinning approach sounds much better than locking mechanism we have today for shallow snapshots. Couple of questions:

How does index level snapshot/restore work? Do we handle it later during restore time or when acquiring snapshot we associate a list of indices with a timestamp? This might affect how garbage collection happens for each index.
Since we are implicitly relying on timestamps here, do users need to be mindful of clock synchronization b/w the nodes now?
Since we are making translog garbage collection aware of pinned timestamps, this means we would be holding up extra translog data in remote store?

@sachinpkale thanks for the RFC, I think I got the idea but have a question (my apologies if I missing something): where the timestamps (or epochs as you refer to them) are coming from?

UPD: really sorry for timing but this is the same question @linuxpi is asking (one of)

Thanks for the review @linuxpi and @reta

How does index level snapshot/restore work? Do we handle it later during restore time or when acquiring snapshot we associate a list of indices with a timestamp? This might affect how garbage collection happens for each index.

Initially, index level snapshots will not be supported for snapshot that use pinned timestamps. Index level restore will be supported in the same way it works today. I haven't given a lot thoughts around how to support index level snapshot but the format of pinned timestamps need to be changes in the way you suggested.

Since we are implicitly relying on timestamps here, do users need to be mindful of clock synchronization b/w the nodes now?

Good question. Timestamp on different servers in a cluster need not be exactly same (explained in next para) but yes, users need to make sure that diff is not very high. If we use existing cloud services, they promise microsecond level accuracy (Example: https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/)

Why don't we need timestamps to be synchronised on different nodes in the cluster? Currently, when we take snapshot, it is not guaranteed that each shard will trigger flush at the same time and upload the data. Based on number of shards, the difference between first and last shard getting snapshotted can be in minutes. With pinned timestamps, we will actually be minimising the difference as each node will maintain the same state of pinned timestamp. Only difference would be timestamp diff between the nodes and we expect this to be few seconds.

Since we are making translog garbage collection aware of pinned timestamps, this means we would be holding up extra translog data in remote store?

Yes. In remote backed storage, we purge remote translog on refresh. This means, we will be holding translog data since last refresh in the remote store.

where the timestamps (or epochs as you refer to them) are coming from?

Timestamp Pinning would be owned by remote backed storage. Snapshot would be one of the users of it. Initially, only snapshot would be pinning the timestamp but we plan to expose an API if required.

where the timestamps (or epochs as you refer to them) are coming from?

Timestamp Pinning would be owned by remote backed storage. Snapshot would be one of the users of it. Initially, only snapshot would be pinning the timestamp but we plan to expose an API if required.

Also, how snapshot will pin the timestamp will be covered in another RFC.

@sachinpkale Thanks for the RFC. Looking forward to lower level information regarding the garbage cleanup for pinned timestamp information during failure scenarios.

Thanks for the RFC @sachinpkale. Couple of questions.

How long the translog has to be retained in the new approach? What would be the cost implications if any?
Are you considering to support both timestamp based snapshots and lock based snapshots? or only timestamp based snapshots will be supported going forward?

Thanks for the review @backslasht

How long the translog has to be retained in the new approach?

With remote store, we retain remote translog since last refresh. So, in this case, if we pin timestamp at 07:00:00 and segment metadata matching the timestamp is at 06:55:00, the translog metadata matching the timestamp would be 06:59:00, then remote translog will have data since 06:55:00.

What would be the cost implications if any?

We will be retaining translog data since last refresh for a given snapshot.

Are you considering to support both timestamp based snapshots and lock based snapshots? or only timestamp based snapshots will be supported going forward?

We will be supporting lock based snapshots at least in 2.x to retain backwards compatibility. We can think of deprecating it as part of 3.x

Thanks @sachinkale for this detailed RFC. Very excited to see this. with this feature, we will be very close to supporting PITR. Have following comments/queries:

Failures can still happen while uploading snapshot metadata but snapshot status would either be successful or failed. We will not have partial snapshot state.

why is this the case? Are we saying we will mark a snapshot as failed if snapshot metadata of any index of that snapshot fails?

To avoid triggering flush/refresh on each shard and handling potential failures, in this approach, we make translog garbage collector aware of snapshot locks.

Looks like we support this pinning for segment data and translog data. since we support capturing cluster state snapshot as well. Do we plan to do something similar for remote cluster state as well in future?

not totally related (or maybe will be discussed as part of design), but one another issue we had with shallow snapshots was. once a index is deleted, snapshot layer had to take care of remote store cleanup. With this approach, i can see that we do not need any direct communication between snapshot layer and remote store. So are we planning to introduce some other cluster level garbage collector or something as well that would take care of pinned md cleanup after the index is deleted?

Looks like we support this pinning for segment data and translog data. since we support capturing cluster state snapshot as well. Do we plan to do something similar for remote cluster state as well in future?

Yes, we plan to support pinning of cluster state as well.

With this approach, i can see that we do not need any direct communication between snapshot layer and remote store.

We still need the same cleanup approach as of today.

opensearch-project / OpenSearch