Open soosinha opened 1 year ago
Tagging @dblock @andrross @itiyamas @sohami @Bukhtawar for feedback
@soosinha @shwetathareja We've got an issue discussing a safety/correctness checker: #3866 Have you looked at creating (or updating the AL2.0-licensed models from pre-fork Elasticsearch) any kind of formal model? This is critical stuff and easy to get wrong in subtle ways and a formal model could give confidence that we're not introducing correctness issues.
This would eventually pave the way to make OpenSearch stateless where cluster manager nodes would only use remote state as the source of truth.
I could be wrong, but I suspect we'll have a hard time building a stateless solution on top of the object store-based repository interface given the object stores aren't really designed for doing things like a strongly consistent check-and-set operation. Maybe we don't need to address this just yet, but I'm thinking we may eventually need another abstraction over a consistent data store for something like this.
Thanks @andrross for the feedback.
Yes, we plan to run the existing TLA+ model for correctness. For Phase1, it shouldn't require any changes where it is running remote backup mode. For subsequent phases when it runs in remote only mode, we will update it accordingly. Thanks for the callout. @soosinha this should be called out explicitly.
I could be wrong, but I suspect we'll have a hard time building a stateless solution on top of the object store-based repository interface given the object stores aren't really designed for doing things like a strongly consistent check-and-set operation. Maybe we don't need to address this just yet, but I'm thinking we may eventually need another abstraction over a consistent data store for something like this.
Yes you are right, we will need a strongly consistent store with conditional updates when we want to achieve stateless solution where we don't need 3 or more cluster manager nodes for quorum. The intention is to first provide the durability using the same object store (with minimal friction instead of adding new repository type) used for the shard (segment/ translog) storage. The cluster manager still depends on the existing coordination code for quorum and correctness while using remote storage for metadata management instead of local for better durability. Eventually, it should move away from any local coordination and just work with remote store. At that time, it will not need 3 cluster manager nodes and can work with leader + backup (for high availability). We will also need to handle consistency of in memory data structures like shard routing information, in progress snapshots etc. which are not persisted on disk today.
We are going ahead with the initial implementation for 2.10 release. Meta issue for task list: https://github.com/opensearch-project/OpenSearch/issues/9344
A new mandatory param cluster-uuid will be added in the _remotestore/_restore API. When cluster-uuid is passed, the entire cluster state will be restored. In the first phase, we will be backing up only index metadata. So only index metadata will be restored. Internally, the cluster state with the highest term+version for the cluster UUID provided in the request will be fetched for restoration and the entire index metadata should be restored and not just a few selected indices.
I don't think thats asking the users to maintain a cluster-uuid is reasonable, we are typically punting metadata durability semantic to the user rather than maintaining it on the remote store and ensuring it is the only custodian. This will lead to races in the system which might lead to bugs/data loss if the user creates or modifies metadata between resurrecting the cluster and restoring with a cluster uuid. I would suggest avoiding taking cluster uuid as an input from the user.
This will lead to races in the system which might lead to bugs/data loss if the user creates or modifies metadata between resurrecting the cluster and restoring with a cluster uuid. I would suggest avoiding taking cluster uuid as an input from the user.
This is not the long term proposal. This is only Phase 1 plan to align with 2.10 timelines where index metadata is being backed to remote for durability but local state is still the authority for cluster coordination. Today, QL recovery is completely manual. During the Quorum loss recovery process, the cluster UUID changes which is immutable information once the cluster is bootstrapped. The next phase proposal (mostly in 2.11) is to have auto restore flow where cluster manager nodes will pull the cluster state from remote first and then only proceed with cluster bootstrap or QL recovery. There will only be one cluster UUID to ensure consistency.
Today, the index metadata recovery will not proceed in case there is an index with same name. So, that ensures there is no data loss. User would have to delete existing indices with same name before the indices can be restored from remote store which would be an atomic operation. For 2.10 release, it is not backing/ restoring other metadata like data stream/ aliases which can anyway impact search/ ingestion for the user if they are using data stream or aliases as abstraction.
Basically the flow would be:
I thought more on how to get rid of taking ClusterUUID in the restore API for 2.10 launch which anyway should go away with auto restore etc.
To give a bit of background on ClusterUUID ClusterUUID is the unique identity of the cluster. It is generated when the cluster is bootstrapped successfully for the first time. Also every time, the quorum of cluster manager nodes change forcefully (not automatically) in the cluster, a new ClusterUUID is generated to indicate the same. This is done to ensure correctness. e.g. n1, n2, n3 are 3 nodes in the cluster. Due to hardware failure n2 and n3 are replaced, n1 is the only node left behind. This cluster has lost it quorum. Now, n4 and n5 come up (in the place of n2 and n3). Now n1, n4 and n5 will form the cluster forcefully with manual intervention and will generate a new ClusterUUID. In case the ClusterUUID was kept same then, this breaks the invariant that states, in order to change the quorum automatically, you need the quorum of nodes to approve the change i.e. in the previous example in order to remove n2 and n3 from quorum of nodes, those should be alive which is not possible when the nodes go down in unplanned way. And, hence ClusterUUID is changed during Quorum loss (QL) recovery to indicate that the quorum was changed forcefully.
Now when a cluster has gone through multiple QL recoveries, it would result in multiple UUIDs being created in remote store. During the automatic restore, the system doesn’t know which ClusterUUID to be used to restore. One option is to use simple timestamps but the clocks could be skewed across nodes and hence can’t alone decide which is the latest clusterUUID.
Each ClusterUUID can maintain in its marker/ manifest files, the previous ClusterUUID as well. But, this is not enough to find the latest UUID as there can be race conditions where multiple UUIDs can land up in remote with same previous ClusterUUID. At that time, we need a deterministic algorithm to resolve the race condition. e.g.
The proposed algorithm to find the latest clusterUUID during automatic restore of metadata from remote state is:
Any limits to how large a cluster state can get? Is there a future with infinite shards in a cluster, since remote storage can theoretically support that?
@sathishbaskar As the first step, we are using remote storage for providing durability to OpenSearch clusters. There will be no cluster state or data loss with remote storage enabled. It applies to all cluster configurations including single node clusters.
Yes, this will evolve into giving scaling benefits like indices metadata will be lazily downloaded on the data nodes on demand when an index is assigned to them. This will help in scaling metadata across large no. of nodes. Now, scaling to infinite shards requires more optimizations around how shards are handled via RoutingTable
in cluster state besides metadata. e.g. this is one improvement - https://github.com/opensearch-project/OpenSearch/issues/8098 more such improvements are in the pipeline.
Overview
OpenSearch currently stores data locally on the storage attached to the data nodes. But with the remote storage coming into picture(#1968), the indexed data will be published to blob storage configured by the user like Amazon S3, OCI Object Storage, Azure Blob Storage and GCP Cloud Storage. In case of data node failures, the data can be restored from remote store thereby guaranteeing durability of data. But in case there is a failure of majority of cluster-manager nodes, the cluster state will be lost which contains the metadata of all indices along with information around latest copy of the shards. The stale cluster state can in turn also cause data loss. So, in order to guarantee complete durability of the cluster, the cluster state should also be published to the remote store which provides higher durability. In case of disaster recovery, the cluster metadata will be recovered first then followed by the indexed data from remote store.
Terminology
Quorum - The majority of cluster manager nodes which is defined as n/2+1 which participate in election and cluster state updates. Cluster UUID - When the cluster is bootstrapped for the first time, a new UUID is generated. It becomes the identity of the cluster until there is a permanent loss of quorum (due to hardware issues etc.) and during recovery a new cluster UUID is generated. Term - It is a monotonically increasing number starting from 1. Every time there is a new election in the cluster, it increases the term by 1. Version - Every cluster state changes is accounted by change in version. It is also monotonically increasing number starting from 1. Every state change is executed sequentially. Publish and Commit - OpenSearch uses 2 phase commit to update the state. The active cluster manager would broadcast the new state to all the nodes in the cluster. This is called “publish” phase. Once, it has received acknowledgements from quorum of cluster manager nodes, it now starts the “commit” phase and again broadcast to all the nodes to apply the new state. It is the latest applied state which is used on the nodes during indexing/ search, or any code flow. Voting Configuration - It defines which cluster manager nodes can participate in quorum. Any changes to voting configuration also requires quorum of nodes. It is persisted on disk. There is lastAcceptedConfiguration - the configuration for which last state was published and lastCommitedConfiguration - the configuration for which last state was committed.
Vision
The remote cluster state would provide higher durability guarantees. This would eventually pave the way to make OpenSearch stateless where cluster manager nodes would only use remote state as the source of truth.
Proposed Solution
Today, cluster state is treated a single atomic unit. Any changes to the state are processed in sequential order. But, we want to break it into individual metadata resource type/entities to make it scalable. e.g. indices mappings, settings are stored in IndexMetadata which would treated as different metadata type. Similarly, there is other information like cluster settings, templates, data-stream metadata etc. which can be treated as separate entity.
We are proposing the storage of metadata as SMILE format files in remote blob storage. The cluster metadata can be broken into smaller entities for example one object for each index metadata. Since the cluster metadata is now broken into smaller entities, there should be a consolidated view which contains a hash table of unique names of the entities (like index names) and the blob store key for the for the entity metadata(like index metadata). We can call this a marker file. This marker file should be uploaded at the end once all the granular metadata is uploaded in order to ensure atomicity. A cluster state would be considered published only when the marker file is present. The marker file will have a unique identifier which should be created using the cluster term, cluster state version and cluster UUID. We would add the timestamp to this identifier to prevent conflicts. In case a new cluster state is published, we will only upload the new/updated component files means that if an index metadata has not changed its file will not be uploaded again. The marker file will be recreated using all entity metadata file names whether they were changed or not. The upload to blob storage would be done before sending the cluster state publish request to all nodes over transport layer. With this proposed solution, it will be easier to evolve in future, for example separate out the index metadata or the template metadata from cluster state. Also, it will allow nodes to selectively download the relevant component metadata instead of downloading the entire cluster state.
Disaster Recovery/Restore
Today, the state resides on the local disk of cluster manager nodes. Any time we lose of quorum (majority) of cluster manager nodes, it can cause data loss in the cluster and there is no way to restore the exact same state deterministically. The viable options are best effort recovery from the current state of surviving node or restore from snapshot. With remote cluster state, quorum loss shouldn’t result in data loss at all. The Cluster manager updates the state directly in remote during any updates and each node downloads the state from remote during bootstrapping or every election. This ensures durable and consistent cluster state. During disaster recovery when cluster manager nodes comes up they directly read the latest state from remote and elect an active cluster-manager amongst them. Eventually, we can move to a primary-backup mode via remote leasing where there are only two cluster-manager nodes(primary and backup). If the primary node goes down, the backup node should automatically become the primary cluster-manager. However, this would introduce dependency on remote and with it failure scenarios related to it, so we need to ensure the remote service is highly available.
Considerations
Atomicity and Strict Ordering
We need to ensure that the full cluster metadata is published only when all the entity metadata is persisted. This will be done by using the marker file as per the solution proposed above. The changes should be persisted in a strict order. There should be no out-of-order updates.
Durability
Durability of cluster metadata depends on the durability of the configured remote store. The user can choose a remote store based on their durability requirements.
Consistency
The active cluster-manager node will write to the remote store first and then apply the state in memory when all the nodes have responded to the apply commit request. So the in-memory cluster state in all the nodes will be eventually consistent with the remote state. Any metadata which is written to remote store should not be lost.
Performance
We need to upload the cluster metadata in synchronous call in order to ensure that the metadata is completely persisted to remote store before sending to other nodes over the wire. If there is any problem in publishing to remote store, we would fail the cluster state publication. Since the metadata will be upload synchronously, it will increase the latency of cluster state publish process. We will try to keep this increase in latency in check by uploading the files in parallel.
Breakdown of Cluster State
The metadata in the cluster state should be breakdown down in smaller logical metadata entities. The data nodes are can fetch the metadata units as needed. Some of the units identified are below:
Metadata Scaling
Today, all the nodes download the entire cluster state. This causes bottlenecks with respect to JVM usage, transport layer bandwidth, disk I/O. As we breakdown the cluster state into smaller metadata entities, the nodes can download only the relevant metadata on demand for example a data nodes would load the metadata of only the indices whose shards are present on that particular node. Also, the cluster-manager would only share the version of cluster state to the other nodes over the transport layer and the corresponding node can decide what entity metadata to download. This will improve the scalability of the cluster and the metadata will become more cloud-native friendly.
File Format
Currently the cluster metadata is persisted to local disk in the SMILE format. SMILE has a few benefits of JSON in terms of serialization. This format will help send lower the size of data over the network compared to JSON format. We would also be able to reuse the existing serialization logic.
Phase 1 (OS 2.10)
Remote Cluster State
As part of Phase 1, cluster state remote backup will be supported. The cluster state is persisted in both local disk and remote (as backup). The cluster manager node will use the remote store state on demand when restore api is triggered. In this phase, only IndexMetadata in the cluster state will be backed up.
In upcoming phases, this support will be extended to other entities in cluster state listed above. And, eventually remote cluster state will not remain just a backup and will serve as the only persistent cluster state source. In the remote only persistence mode, the nodes will only maintain in memory state and will always rely on remote state as the source of truth.
High Level Publish/ApplyCommit Process
Disaster Recovery/Restore
Remote Store Restore currently works assuming indices metadata is present in the cluster state. Since IndexMetadata will now be available in remote in Phase 1 we can use it to recover indices which have been removed from cluster state after disaster recovery.
In Phase 1, we add the support to manually restore cluster state via the existing Remote Store Restore API. Support for auto restore from remote will be added in upcoming Phases.
High Level Restore Flow
Reconciliation
The local state on the disk and remote state will be continuously reconciled to ensure correctness. Pending further details.
[Modify] Remote Store Restore API
A new mandatory param
cluster-uuid
will be added in the _remotestore/_restore API. When cluster-uuid is passed, the entire cluster state will be restored. In the first phase, we will be backing up only index metadata. So only index metadata will be restored. Internally, the cluster state with the highest term+version for the cluster UUID provided in the request will be fetched for restoration and the entire index metadata should be restored and not just a few selected indices.Next Steps
On the basis of the feedback from the community, we will create a detailed design for this feature.
FAQs
1. What is cluster state and what does it contain ? Cluster state is an object keeps track of a lot of current state about the cluster. It is managed by the active-cluster manager node. Any creation of index or data-streams first updates the information in the cluster state. The cluster state appliers listen to the changes in the cluster state and call appropriate actions to apply them on the cluster. The cluster state is loaded in memory on every nodes and also persisted on the disk as a lucene index. The cluster state includes the following information (not-exhaustive):
2. How is cluster state persisted currently ? PersistedClusterStateService has the logic of persisting the cluster state to disk and reading it back from disk. It uses the
NIOFSDirectory
implementation ofDirectory
interface to setup a local directly. LuceneIndexWriter
uses thisDirectory
to create Lucene index and add documents to it. Although cluster state contains a lot of information, only the metadata object in the cluster state is persisted on disk. This is probably because the routing table and routing nodes information can be reconstructed on the basis of nodes and shards present in the cluster. The metadata is stored in a lucene index. The global metadata is stored in one lucene document while the each index metadata is stored in one lucene document (reference). This is used by cluster-manager-eligible nodes to record the last-accepted cluster state during publication. The metadata is written incrementally where possible, leaving alone any documents that have not changed.