Closed ashking94 closed 2 months ago
Thanks @ashking94 for the proposal.
In the above example the base path is empty or not there. If there was a base path, then the base path would show up between 1st and 2nd word.
Currently, entire snapshot contents (data + metadata) are stored within the provided base path. With new approach, this is not true anymore (correct me if I am wrong).
This may change the way user organises snapshots (For example, I would be running 5 clusters each using repository-s3 plugin for snapshot and uses bucket s3://snapshot_bucket
with base path as cluster1/2/3/4/5
and cleanup the corresponding base path once a particular cluster is no longer needed and deleted).
To keep the existing behaviour consistent, does it make sense to introduce the proposed changes as a new type of snapshot?
Currently, entire snapshot contents (data + metadata) are stored within the provided base path. With new approach, this is not true anymore (correct me if I am wrong).
That's right.
This may change the way user organises snapshots (For example, I would be running 5 clusters each using repository-s3 plugin for snapshot and uses bucket s3://snapshot_bucket with base path as cluster1/2/3/4/5 and cleanup the corresponding base path once a particular cluster is no longer needed and deleted).
We will have a fixed path where we will be uploading the paths for all the different paths where the data is stored corresponding to a cluster. Also, this will be by default disabled on a cluster for not breaking backward compatibility.
To keep the existing behaviour consistent, does it make sense to introduce the proposed changes as a new type of snapshot?
I am still debating in my head how to support the new mode with data that has been uploaded already in the fixed path. Apart from that, we would also need capability to have the new mode only for repository r1, but not repository r2. Let me cover these in the PRs.
Thanks @ashking94 for the proposal. On the similar lines of what sachin mentioned. Today an admin level user can give path level access to different users in a bucket and those users can provide those paths as base paths and can use it across multiple clusters. looks like this feature would not work for those usecases. to use it, user must have root level access on the bucket.
Thanks @ashking94 for the proposal. On the similar lines of what sachin mentioned. Today an admin level user can give path level access to different users in a bucket and those users can provide those paths as base paths and can use it across multiple clusters. looks like this feature would not work for those usecases. to use it, user must have root level access on the bucket.
Thanks for your comment, @harishbhakuni. The access would need to be provided at the bucket level to the cluster. Having more cluster would allow the autoscaling to work even better. We definitely need to build some mechanisms to segregate access to the domain level paths based on the base path substring in the key path.
Problem statement
Snapshots are backups of a cluster's indexes and state, including cluster settings, node information, index metadata, and shard allocation information. They are used to recover from failures, such as a red cluster, or to move data between clusters without loss. Snapshots are stored in a repository in a hierarchical manner that represents the composition of shards, indexes, and the cluster. However, this structure poses a scaling challenge when there are numerous shards due to limitations on concurrent operations over a fixed prefix in a remote store. In this RFC, we discuss various aspects to achieve a solution that scales well with a high number of shards.
Current repository structure
Below is the current structure of the snapshot - Image used from https://opensearch.org/blog/snapshot-operations/.
The files created once per snapshot or once per index per snapshot are somewhat immune to throttling due to their fewer numbers and are uploaded by the active cluster manager using only five snapshot threads. These files include:
Files susceptible to throttling are created on data nodes, generally per primary shard per snapshot:
For index with snapshot uuid Xy1234-z_x. Similarly, there will be more number of folder for different indexes.
Issue with current structure
The existing structure leads to throttling in clusters with a high shard count, resulting in longer snapshot creation and deletion times due to retries. In worst-case scenarios, this can lead to partial or failed snapshots.
Requirements
Proposed solution
Introduce a prefix pattern accepted by multiple repository providers (e.g., AWS S3, GCP Storage, Azure Blob Storage) that maximizes data spread across prefixes for better scaling. This prefix strategy will be applied to shard-level files. The general recommendation by the providers is to maximise the spread of data across as many prefixes as possible. This allows them to scale better. I propose to introduce this prefix strategy for shard level files. This has been introduced already in https://github.com/opensearch-project/OpenSearch/issues/12567 for remote store shard level data & metadata files.
Key changes in the proposed structure:
High level approach
Store the path type in customData within IndexMetadata, which is already stored during snapshot creation.
Cloud neutral solution
This approach is supported by multiple cloud providers:
Proposed repository structure
Appendix
Sample current repository structure