opensearch-project / OpenSearch

πŸ”Ž Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.77k stars 1.82k forks source link

[RFC] Optimized Prefix Pattern for Shard-Level Files for Efficient Snapshots #15146

Closed ashking94 closed 2 months ago

ashking94 commented 3 months ago

TL;DR - This RFC, inspired by https://github.com/opensearch-project/OpenSearch/issues/12567, proposes an optimized prefix pattern for shard-level files in a snapshot.

Problem statement

Snapshots are backups of a cluster's indexes and state, including cluster settings, node information, index metadata, and shard allocation information. They are used to recover from failures, such as a red cluster, or to move data between clusters without loss. Snapshots are stored in a repository in a hierarchical manner that represents the composition of shards, indexes, and the cluster. However, this structure poses a scaling challenge when there are numerous shards due to limitations on concurrent operations over a fixed prefix in a remote store. In this RFC, we discuss various aspects to achieve a solution that scales well with a high number of shards.

Current repository structure

Below is the current structure of the snapshot - image Image used from https://opensearch.org/blog/snapshot-operations/.

The files created once per snapshot or once per index per snapshot are somewhat immune to throttling due to their fewer numbers and are uploaded by the active cluster manager using only five snapshot threads. These files include:

  1. index-N (RepositoryData)
  2. index.latest (Latest generation of RepositoryData to refer to)
  3. incompatible-snapshots (lust of snapshot ids that are no longer compatible with current OpenSearch version)
  4. snap-uuid.dat (SnapshotInfo for snapshot )
  5. meta-uuid.dat (Metadata for snapshot )
  6. indices/Xy1234-z_x/meta-uuid.dat (IndexMetadata for the index)

Files susceptible to throttling are created on data nodes, generally per primary shard per snapshot:

  1. indices/Xy1234-z_x/0/__VP05oDMVT & more (Files mapped to real segment files)
  2. indices/Xy1234-z_x/0/snap-uuid.dat (BlobStoreIndexShardSnapshot)
  3. indices/Xy1234-z_x/0/index-uuid (BlobStoreIndexShardSnapshots)
  4. Similar folders for different shards

For index with snapshot uuid Xy1234-z_x. Similarly, there will be more number of folder for different indexes.

Issue with current structure

The existing structure leads to throttling in clusters with a high shard count, resulting in longer snapshot creation and deletion times due to retries. In worst-case scenarios, this can lead to partial or failed snapshots.

Requirements

  1. Ensure smooth snapshot operations (create, delete) without throttling for high shard counts.
  2. Implement a scalable prefix strategy to handle increasing concurrency on the remote store as shard count grows.

Proposed solution

Introduce a prefix pattern accepted by multiple repository providers (e.g., AWS S3, GCP Storage, Azure Blob Storage) that maximizes data spread across prefixes for better scaling. This prefix strategy will be applied to shard-level files. The general recommendation by the providers is to maximise the spread of data across as many prefixes as possible. This allows them to scale better. I propose to introduce this prefix strategy for shard level files. This has been introduced already in https://github.com/opensearch-project/OpenSearch/issues/12567 for remote store shard level data & metadata files.

Key changes in the proposed structure:

High level approach

Store the path type in customData within IndexMetadata, which is already stored during snapshot creation.

Cloud neutral solution

This approach is supported by multiple cloud providers:

  1. GCP Storage - https://cloud.google.com/storage/docs/request-rate#ramp-up
  2. AWS S3 - https://repost.aws/knowledge-center/http-5xx-errors-s3
  3. Azure blob storage - https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist

Proposed repository structure

<ROOT>
β”œβ”€β”€ 210101010101101
β”‚   └── indices
β”‚       └── 1
β”‚           └── U-DBzYuhToOfmgahZonF0w
β”‚               β”œβ”€β”€ __C-Fa1S-NSbqEWJU3h1tdVQ
β”‚               β”œβ”€β”€ __EeKqbgKCRHuHn-EF5GzlJA
β”‚               β”œβ”€β”€ __JS_z4nUURmSsyFVsmgmBcA
β”‚               β”œβ”€β”€ __a2cczAV8Sg2hcqs1uQBryA
β”‚               β”œβ”€β”€ __d0CF9LqhTN6REWmuLzO96g
β”‚               β”œβ”€β”€ __gGvHzcveQwiXknAWxor2IQ
β”‚               β”œβ”€β”€ __jlv0IqxNRGqbmXZNQ3cr4A
β”‚               β”œβ”€β”€ __oEfbUhM3QvOh7EDg1lzlnQ
β”‚               β”œβ”€β”€ index-Y99a4XQjRwmVM5s5uIrFRw
β”‚               β”œβ”€β”€ snap-L7HfWiUYSj6as4jSi3yKeg.dat
β”‚               └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
β”œβ”€β”€ 910010111101110
β”‚   └── indices
β”‚       └── 0
β”‚           └── LsKqUfn2ROmjAkW6Rec1lQ
β”‚               β”œβ”€β”€ __8FggEzgeRXaXUQY3U6Xlmg
β”‚               β”œβ”€β”€ __F1iiaK-uTcmUD608gQ54Ng
β”‚               β”œβ”€β”€ __JW0cFZiVTzaAOdzK03bsEw
β”‚               β”œβ”€β”€ __N19wYbk0TVOqkU8NJw6ptA
β”‚               β”œβ”€β”€ __dE6opnQ6QLmChNwQ4i3L6w
β”‚               β”œβ”€β”€ __hylfOCALSPiLwRNfR27wbA
β”‚               β”œβ”€β”€ __rNFtqVvuRnOomXxeF0hLfg
β”‚               β”œβ”€β”€ __t8fnoBiATiqR8N9T2Qiejw
β”‚               β”œβ”€β”€ index-cEnUDNHXTveF6OR779cKSA
β”‚               β”œβ”€β”€ snap-L7HfWiUYSj6as4jSi3yKeg.dat
β”‚               └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
β”œβ”€β”€ M11010111001101
β”‚   └── indices
β”‚       └── 0
β”‚           └── U-DBzYuhToOfmgahZonF0w
β”‚               β”œβ”€β”€ __EKjl8lZlTCuUADJKlCM1GQ
β”‚               β”œβ”€β”€ __Gd_-r-qlTve8B-VwxKicWA
β”‚               β”œβ”€β”€ __f9qc9DWARWu6cw41q0pbOw
β”‚               β”œβ”€β”€ __g_dEX3n3TISxMmO_SDmjYg
β”‚               β”œβ”€β”€ __jjBTwJCrQKuT6gPxfNRbag
β”‚               β”œβ”€β”€ __neVNkcaOSjuNAW_J9XiPag
β”‚               β”œβ”€β”€ __ubsXbHN5QMi6dX0ragf-TQ
β”‚               β”œβ”€β”€ __yzFDy62NTuildlyJc0DknA
β”‚               β”œβ”€β”€ index-5RTtDql2Q9CBrBVcKgfI5Q
β”‚               β”œβ”€β”€ snap-L7HfWiUYSj6as4jSi3yKeg.dat
β”‚               └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
β”œβ”€β”€ index-3
β”œβ”€β”€ index.latest
β”œβ”€β”€ indices
β”‚   β”œβ”€β”€ LsKqUfn2ROmjAkW6Rec1lQ
β”‚   β”‚   └── meta-eNNJLJEBNa6hCAn8tcSn.dat
β”‚   └── U-DBzYuhToOfmgahZonF0w
β”‚       └── meta-d9NJLJEBNa6hCAn8tcSn.dat
β”œβ”€β”€ meta-L7HfWiUYSj6as4jSi3yKeg.dat
β”œβ”€β”€ meta-QajL9GX7Tj2Itdbm5GzAqQ.dat
β”œβ”€β”€ snap-L7HfWiUYSj6as4jSi3yKeg.dat
β”œβ”€β”€ snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
└── w10110111010101
    └── indices
        └── 1
            └── LsKqUfn2ROmjAkW6Rec1lQ
                β”œβ”€β”€ __ABzT7uaYQq2_WpF22ewuJg
                β”œβ”€β”€ __PTGQRYT_QyeAZ7llHCHP_Q
                β”œβ”€β”€ __ROWGAvOQRWarW9K6c_8BuQ
                β”œβ”€β”€ __RYbp_hw2Qb2T9rEECYZM5w
                β”œβ”€β”€ __YFbMl9ZNRMqm9i7ItDkz4Q
                β”œβ”€β”€ __bGaQqDsPR--08ifzAOA23A
                β”œβ”€β”€ __dTPVxsjCTfqeUg37B8_zUg
                β”œβ”€β”€ __z1N6Uz2MR26GoNjLuU_LQA
                β”œβ”€β”€ index-Mv_2sClsS1Wf0ynx8c697g
                β”œβ”€β”€ snap-L7HfWiUYSj6as4jSi3yKeg.dat
                └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat

Appendix

Sample current repository structure

<BASE_PATH>
β”œβ”€β”€ index-1
β”œβ”€β”€ index.latest
β”œβ”€β”€ indices
β”‚   β”œβ”€β”€ OQb-77UPRLmek3IzfRR3Fg
β”‚   β”‚   β”œβ”€β”€ 0
β”‚   β”‚   β”‚   β”œβ”€β”€ __3lX2qNtkShSMD2zhbW2hww
β”‚   β”‚   β”‚   β”œβ”€β”€ __5ILryCGbSd-ojWvIxAK7Mw
β”‚   β”‚   β”‚   β”œβ”€β”€ __UHpuUxuoTyqrc5kzCmSAHA
β”‚   β”‚   β”‚   β”œβ”€β”€ __c7Tuu2zRSROfiy7UCq7KOQ
β”‚   β”‚   β”‚   β”œβ”€β”€ __d1jt099nQSycUnO4mx9XXQ
β”‚   β”‚   β”‚   β”œβ”€β”€ __oUVAcviJRdu0kWqYQyLrnA
β”‚   β”‚   β”‚   β”œβ”€β”€ index-GoobOcTzQmqG5LI67Y7dvg
β”‚   β”‚   β”‚   β”œβ”€β”€ snap-4JlPnU7nT5O1dzW7SP318Q.dat
β”‚   β”‚   β”‚   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
β”‚   β”‚   β”œβ”€β”€ 1
β”‚   β”‚   β”‚   β”œβ”€β”€ __DRjMuHHKRa-FluuZl94qLQ
β”‚   β”‚   β”‚   β”œβ”€β”€ __EDoFw1gqQQastQjG8bat1Q
β”‚   β”‚   β”‚   β”œβ”€β”€ __GypZv7z0SUu6LNiML4sefQ
β”‚   β”‚   β”‚   β”œβ”€β”€ __ZP1jsyQnTQOMIO9Vx2n5Eg
β”‚   β”‚   β”‚   β”œβ”€β”€ __nrm3d_o9Skqw3tAyyPnr7A
β”‚   β”‚   β”‚   β”œβ”€β”€ __qI-jax0aTaycxgZkzsKmRw
β”‚   β”‚   β”‚   β”œβ”€β”€ index-WsEwZHu6QZ-_Ub5h8p0AGA
β”‚   β”‚   β”‚   β”œβ”€β”€ snap-4JlPnU7nT5O1dzW7SP318Q.dat
β”‚   β”‚   β”‚   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
β”‚   β”‚   └── meta-UV9P7pABMtEAgBnTfhV6.dat
β”‚   └── gYpI7EgrQlqGqiKBWnp1bw
β”‚       β”œβ”€β”€ .DS_Store
β”‚       β”œβ”€β”€ 0
β”‚       β”‚   β”œβ”€β”€ __ATpweBS4RV2e7-iWmOItfg
β”‚       β”‚   β”œβ”€β”€ __Kv_a0dSNQfyzSlkiONETnw
β”‚       β”‚   β”œβ”€β”€ __Q3EuuawlQQq2Cm2rHZpYRw
β”‚       β”‚   β”œβ”€β”€ __kHrmxfa5RAuHhjEEeP8czA
β”‚       β”‚   β”œβ”€β”€ index--0B07yZUQr63TPhdStkDzQ
β”‚       β”‚   β”œβ”€β”€ snap-4JlPnU7nT5O1dzW7SP318Q.dat
β”‚       β”‚   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
β”‚       β”œβ”€β”€ 1
β”‚       β”‚   β”œβ”€β”€ __N2Imh37BSQGwFVmg5GQAsg
β”‚       β”‚   β”œβ”€β”€ __OVbkh2p-QxGjFfMM-XuQ-g
β”‚       β”‚   β”œβ”€β”€ __r2989-oaR_S--xOC-1Mlgw
β”‚       β”‚   β”œβ”€β”€ __r5yJ8QY-TdOK9Uy5cmozHw
β”‚       β”‚   β”œβ”€β”€ index-oqu0C2_PTXamVfS7BCpH1A
β”‚       β”‚   β”œβ”€β”€ snap-4JlPnU7nT5O1dzW7SP318Q.dat
β”‚       β”‚   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
β”‚       └── meta-UF9P7pABMtEAgBnTfhV6.dat
β”œβ”€β”€ meta-4JlPnU7nT5O1dzW7SP318Q.dat
β”œβ”€β”€ meta-I0esTsK9Qiy114LI1xFD4Q.dat
β”œβ”€β”€ snap-4JlPnU7nT5O1dzW7SP318Q.dat
└── snap-I0esTsK9Qiy114LI1xFD4Q.dat
sachinpkale commented 3 months ago

Thanks @ashking94 for the proposal.

In the above example the base path is empty or not there. If there was a base path, then the base path would show up between 1st and 2nd word.

Currently, entire snapshot contents (data + metadata) are stored within the provided base path. With new approach, this is not true anymore (correct me if I am wrong).

This may change the way user organises snapshots (For example, I would be running 5 clusters each using repository-s3 plugin for snapshot and uses bucket s3://snapshot_bucket with base path as cluster1/2/3/4/5 and cleanup the corresponding base path once a particular cluster is no longer needed and deleted).

To keep the existing behaviour consistent, does it make sense to introduce the proposed changes as a new type of snapshot?

ashking94 commented 3 months ago

Currently, entire snapshot contents (data + metadata) are stored within the provided base path. With new approach, this is not true anymore (correct me if I am wrong).

That's right.

This may change the way user organises snapshots (For example, I would be running 5 clusters each using repository-s3 plugin for snapshot and uses bucket s3://snapshot_bucket with base path as cluster1/2/3/4/5 and cleanup the corresponding base path once a particular cluster is no longer needed and deleted).

We will have a fixed path where we will be uploading the paths for all the different paths where the data is stored corresponding to a cluster. Also, this will be by default disabled on a cluster for not breaking backward compatibility.

To keep the existing behaviour consistent, does it make sense to introduce the proposed changes as a new type of snapshot?

I am still debating in my head how to support the new mode with data that has been uploaded already in the fixed path. Apart from that, we would also need capability to have the new mode only for repository r1, but not repository r2. Let me cover these in the PRs.

harishbhakuni commented 2 months ago

Thanks @ashking94 for the proposal. On the similar lines of what sachin mentioned. Today an admin level user can give path level access to different users in a bucket and those users can provide those paths as base paths and can use it across multiple clusters. looks like this feature would not work for those usecases. to use it, user must have root level access on the bucket.

ashking94 commented 2 months ago

Thanks @ashking94 for the proposal. On the similar lines of what sachin mentioned. Today an admin level user can give path level access to different users in a bucket and those users can provide those paths as base paths and can use it across multiple clusters. looks like this feature would not work for those usecases. to use it, user must have root level access on the bucket.

Thanks for your comment, @harishbhakuni. The access would need to be provided at the bucket level to the cluster. Having more cluster would allow the autoscaling to work even better. We definitely need to build some mechanisms to segregate access to the domain level paths based on the base path substring in the key path.