Closed sachinpkale closed 2 months ago
Nice proposal @sachinpkale , the timestamp pinning approach sounds much better than locking mechanism we have today for shallow snapshots. Couple of questions:
How does index level snapshot/restore work? Do we handle it later during restore time or when acquiring snapshot we associate a list of indices with a timestamp? This might affect how garbage collection happens for each index.
Since we are implicitly relying on timestamps here, do users need to be mindful of clock synchronization b/w the nodes now?
Since we are making translog garbage collection aware of pinned timestamps, this means we would be holding up extra translog data in remote store?
@sachinpkale thanks for the RFC, I think I got the idea but have a question (my apologies if I missing something): where the timestamps (or epochs as you refer to them) are coming from?
UPD: really sorry for timing but this is the same question @linuxpi is asking (one of)
Thanks for the review @linuxpi and @reta
How does index level snapshot/restore work? Do we handle it later during restore time or when acquiring snapshot we associate a list of indices with a timestamp? This might affect how garbage collection happens for each index.
Initially, index level snapshots will not be supported for snapshot that use pinned timestamps. Index level restore will be supported in the same way it works today. I haven't given a lot thoughts around how to support index level snapshot but the format of pinned timestamps need to be changes in the way you suggested.
Since we are implicitly relying on timestamps here, do users need to be mindful of clock synchronization b/w the nodes now?
Good question. Timestamp on different servers in a cluster need not be exactly same (explained in next para) but yes, users need to make sure that diff is not very high. If we use existing cloud services, they promise microsecond level accuracy (Example: https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/)
Why don't we need timestamps to be synchronised on different nodes in the cluster? Currently, when we take snapshot, it is not guaranteed that each shard will trigger flush at the same time and upload the data. Based on number of shards, the difference between first and last shard getting snapshotted can be in minutes. With pinned timestamps, we will actually be minimising the difference as each node will maintain the same state of pinned timestamp. Only difference would be timestamp diff between the nodes and we expect this to be few seconds.
Since we are making translog garbage collection aware of pinned timestamps, this means we would be holding up extra translog data in remote store?
Yes. In remote backed storage, we purge remote translog on refresh. This means, we will be holding translog data since last refresh in the remote store.
where the timestamps (or epochs as you refer to them) are coming from?
Timestamp Pinning would be owned by remote backed storage. Snapshot would be one of the users of it. Initially, only snapshot would be pinning the timestamp but we plan to expose an API if required.
where the timestamps (or epochs as you refer to them) are coming from?
Timestamp Pinning would be owned by remote backed storage. Snapshot would be one of the users of it. Initially, only snapshot would be pinning the timestamp but we plan to expose an API if required.
Also, how snapshot will pin the timestamp will be covered in another RFC.
@sachinpkale Thanks for the RFC. Looking forward to lower level information regarding the garbage cleanup for pinned timestamp information during failure scenarios.
Thanks for the RFC @sachinpkale. Couple of questions.
Thanks for the review @backslasht
How long the translog has to be retained in the new approach?
With remote store, we retain remote translog since last refresh. So, in this case, if we pin timestamp at 07:00:00
and segment metadata matching the timestamp is at 06:55:00
, the translog metadata matching the timestamp would be 06:59:00
, then remote translog will have data since 06:55:00
.
What would be the cost implications if any?
We will be retaining translog data since last refresh for a given snapshot.
Are you considering to support both timestamp based snapshots and lock based snapshots? or only timestamp based snapshots will be supported going forward?
We will be supporting lock based snapshots at least in 2.x to retain backwards compatibility. We can think of deprecating it as part of 3.x
Thanks @sachinkale for this detailed RFC. Very excited to see this. with this feature, we will be very close to supporting PITR. Have following comments/queries:
Failures can still happen while uploading snapshot metadata but snapshot status would either be successful or failed. We will not have partial snapshot state.
why is this the case? Are we saying we will mark a snapshot as failed if snapshot metadata of any index of that snapshot fails?
To avoid triggering flush/refresh on each shard and handling potential failures, in this approach, we make translog garbage collector aware of snapshot locks.
Looks like we support this pinning for segment data and translog data. since we support capturing cluster state snapshot as well. Do we plan to do something similar for remote cluster state as well in future?
not totally related (or maybe will be discussed as part of design), but one another issue we had with shallow snapshots was. once a index is deleted, snapshot layer had to take care of remote store cleanup. With this approach, i can see that we do not need any direct communication between snapshot layer and remote store. So are we planning to introduce some other cluster level garbage collector or something as well that would take care of pinned md cleanup after the index is deleted?
Looks like we support this pinning for segment data and translog data. since we support capturing cluster state snapshot as well. Do we plan to do something similar for remote cluster state as well in future?
Yes, we plan to support pinning of cluster state as well.
With this approach, i can see that we do not need any direct communication between snapshot layer and remote store.
We still need the same cleanup approach as of today.
Goal
Today, for cluster with remote backed storage feature, we use a variant of snapshot, called as shallow snapshot. Shallow snapshots refer data that is already uploaded as part of remote store. In order to prevent deletion of data in remote store that is referred by shallow snapshots, we need a locking mechanism that is used by remote store garbage collection. In this RFC, we discuss current locking mechanism and its shortcomings and propose a new mechanism that scales independent of number of shards/indices/nodes in the cluster. We also discuss how this new approach can be evolved into PITR (point-in-time restore).
Current Locking Mechanism
<metadata_filename>__<snapshot_id>
file underlock
directory in remote store.Sequence Diagram
Issues with Current Locking Mechanism
lightweight
shallow snapshots becomesbulky
.Requirements
Timestamp Based Implicit Locking
In this approach, we will move away from explicit lock file creation for a given metadata file. Instead, we will use timestamp in metadata filename to acquire implicit lock (Refer Metadata Filename Format section in Appendix for more details on metadata filename). We call it Timestamp Pinning.
Proposed Pinned Timestamp Format
Approach
We maintain a list of pinned timestamps at a cluster level. For each timestamp in this list, garbage collection for segment as well as translog will skip deletion of metadata file that matches (Appendix: Metadata file matching a timestamp) the pinned timestamp. To avoid triggering flush/refresh on each shard and handling potential failures, in this approach, we make translog garbage collector aware of snapshot locks.
Steps
remote_store_pinned_timestamps
remote_store_pinned_timestamps
is > X mins, skip garbage collection.md1
is >pinned_timestamp_a
and the timestamp of next metadata filemd2
<=pinned_timestamp_a
addmd2
topinned_metadata_files
pinned_metadata_files
and corresponding data filestimestamp
to restore data to.Sequence Diagram
Pros
Cons
Extending the approach to PITR
As this approach uses timestamp based pinning, it can be extended to point-in-time restore. As pinning timestamp does not involve multiple remote store or node-node calls, we can support timestamp pinning at lower granularity. To avoid the synchronisation delay between pinning and communicating it to data node, in PITR, we can provide capability of fixed intervals. With this, we can support PITR granularity as low as 1 minute (we need to control retention based on granularity). Pinning the timestamp can still be supported for on-demand cases.
Appendix
Metadata file matching a timestamp
T
, if it has the max timestamp among all metadata files with timestamp at mostT
.2024/07/05 17:00:00
and we have following options:metadata_2024_07_05_16_05_51
metadata_2024_07_05_16_25_34
metadata_2024_07_05_16_56_47
metadata_2024_07_05_16_58_21
metadata_2024_07_05_16_59_35
metadata_2024_07_05_17_00_09
metadata_2024_07_05_17_45_12
metadata_2024_07_05_16_59_35
is considered as the metadata file that matches given timestampT
.Metadata Filename Format
Remote Segment Store
metadata__<Inverted Primary Term>__<Inverted Commit Generation>__<Inverted Translog Generation>__<Inverted Refresh Counter>__<Node ID>__<Inverted EPOCH>__<Metadata Version>
metadata__9223372036854775806__9223372036854775796__9223372036854775647__9223372036854775883__-396831118__9223370334830299234__1
Remote Translog
metadata__<Inverted Primary Term>__<Inverted Translog Generation>__<Inverted EPOCH>__<Node ID>__<Metadata Version>
metadata__9223372036854775806__9223372036854775648__9223370334830643807__-396831118__1
Existing Remote Store Garbage Collection Example
Remote Store Garbage Collection Example with Pinned Timestamps