[RFC] Remote Cluster State

Overview

OpenSearch currently stores data locally on the storage attached to the data nodes. But with the remote storage coming into picture(#1968), the indexed data will be published to blob storage configured by the user like Amazon S3, OCI Object Storage, Azure Blob Storage and GCP Cloud Storage. In case of data node failures, the data can be restored from remote store thereby guaranteeing durability of data. But in case there is a failure of majority of cluster-manager nodes, the cluster state will be lost which contains the metadata of all indices along with information around latest copy of the shards. The stale cluster state can in turn also cause data loss. So, in order to guarantee complete durability of the cluster, the cluster state should also be published to the remote store which provides higher durability. In case of disaster recovery, the cluster metadata will be recovered first then followed by the indexed data from remote store.

Terminology

Quorum - The majority of cluster manager nodes which is defined as n/2+1 which participate in election and cluster state updates. Cluster UUID - When the cluster is bootstrapped for the first time, a new UUID is generated. It becomes the identity of the cluster until there is a permanent loss of quorum (due to hardware issues etc.) and during recovery a new cluster UUID is generated. Term - It is a monotonically increasing number starting from 1. Every time there is a new election in the cluster, it increases the term by 1. Version - Every cluster state changes is accounted by change in version. It is also monotonically increasing number starting from 1. Every state change is executed sequentially. Publish and Commit - OpenSearch uses 2 phase commit to update the state. The active cluster manager would broadcast the new state to all the nodes in the cluster. This is called “publish” phase. Once, it has received acknowledgements from quorum of cluster manager nodes, it now starts the “commit” phase and again broadcast to all the nodes to apply the new state. It is the latest applied state which is used on the nodes during indexing/ search, or any code flow. Voting Configuration - It defines which cluster manager nodes can participate in quorum. Any changes to voting configuration also requires quorum of nodes. It is persisted on disk. There is lastAcceptedConfiguration - the configuration for which last state was published and lastCommitedConfiguration - the configuration for which last state was committed.

Vision

The remote cluster state would provide higher durability guarantees. This would eventually pave the way to make OpenSearch stateless where cluster manager nodes would only use remote state as the source of truth.

Proposed Solution

Today, cluster state is treated a single atomic unit. Any changes to the state are processed in sequential order. But, we want to break it into individual metadata resource type/entities to make it scalable. e.g. indices mappings, settings are stored in IndexMetadata which would treated as different metadata type. Similarly, there is other information like cluster settings, templates, data-stream metadata etc. which can be treated as separate entity.

We are proposing the storage of metadata as SMILE format files in remote blob storage. The cluster metadata can be broken into smaller entities for example one object for each index metadata. Since the cluster metadata is now broken into smaller entities, there should be a consolidated view which contains a hash table of unique names of the entities (like index names) and the blob store key for the for the entity metadata(like index metadata). We can call this a marker file. This marker file should be uploaded at the end once all the granular metadata is uploaded in order to ensure atomicity. A cluster state would be considered published only when the marker file is present. The marker file will have a unique identifier which should be created using the cluster term, cluster state version and cluster UUID. We would add the timestamp to this identifier to prevent conflicts. In case a new cluster state is published, we will only upload the new/updated component files means that if an index metadata has not changed its file will not be uploaded again. The marker file will be recreated using all entity metadata file names whether they were changed or not. The upload to blob storage would be done before sending the cluster state publish request to all nodes over transport layer. With this proposed solution, it will be easier to evolve in future, for example separate out the index metadata or the template metadata from cluster state. Also, it will allow nodes to selectively download the relevant component metadata instead of downloading the entire cluster state.

Disaster Recovery/Restore

Today, the state resides on the local disk of cluster manager nodes. Any time we lose of quorum (majority) of cluster manager nodes, it can cause data loss in the cluster and there is no way to restore the exact same state deterministically. The viable options are best effort recovery from the current state of surviving node or restore from snapshot. With remote cluster state, quorum loss shouldn’t result in data loss at all. The Cluster manager updates the state directly in remote during any updates and each node downloads the state from remote during bootstrapping or every election. This ensures durable and consistent cluster state. During disaster recovery when cluster manager nodes comes up they directly read the latest state from remote and elect an active cluster-manager amongst them. Eventually, we can move to a primary-backup mode via remote leasing where there are only two cluster-manager nodes(primary and backup). If the primary node goes down, the backup node should automatically become the primary cluster-manager. However, this would introduce dependency on remote and with it failure scenarios related to it, so we need to ensure the remote service is highly available.

Considerations

Atomicity and Strict Ordering

We need to ensure that the full cluster metadata is published only when all the entity metadata is persisted. This will be done by using the marker file as per the solution proposed above. The changes should be persisted in a strict order. There should be no out-of-order updates.

Durability

Durability of cluster metadata depends on the durability of the configured remote store. The user can choose a remote store based on their durability requirements.

Consistency

The active cluster-manager node will write to the remote store first and then apply the state in memory when all the nodes have responded to the apply commit request. So the in-memory cluster state in all the nodes will be eventually consistent with the remote state. Any metadata which is written to remote store should not be lost.

Performance

We need to upload the cluster metadata in synchronous call in order to ensure that the metadata is completely persisted to remote store before sending to other nodes over the wire. If there is any problem in publishing to remote store, we would fail the cluster state publication. Since the metadata will be upload synchronously, it will increase the latency of cluster state publish process. We will try to keep this increase in latency in check by uploading the files in parallel.

Breakdown of Cluster State

The metadata in the cluster state should be breakdown down in smaller logical metadata entities. The data nodes are can fetch the metadata units as needed. Some of the units identified are below:

Index metadata
Template
Datastream metadata
Plugin metadata
Index graveyard
Persistent tasks

Metadata Scaling

Today, all the nodes download the entire cluster state. This causes bottlenecks with respect to JVM usage, transport layer bandwidth, disk I/O. As we breakdown the cluster state into smaller metadata entities, the nodes can download only the relevant metadata on demand for example a data nodes would load the metadata of only the indices whose shards are present on that particular node. Also, the cluster-manager would only share the version of cluster state to the other nodes over the transport layer and the corresponding node can decide what entity metadata to download. This will improve the scalability of the cluster and the metadata will become more cloud-native friendly.

File Format

Currently the cluster metadata is persisted to local disk in the SMILE format. SMILE has a few benefits of JSON in terms of serialization. This format will help send lower the size of data over the network compared to JSON format. We would also be able to reuse the existing serialization logic.

Phase 1 (OS 2.10)

Remote Cluster State

As part of Phase 1, cluster state remote backup will be supported. The cluster state is persisted in both local disk and remote (as backup). The cluster manager node will use the remote store state on demand when restore api is triggered. In this phase, only IndexMetadata in the cluster state will be backed up.

In upcoming phases, this support will be extended to other entities in cluster state listed above. And, eventually remote cluster state will not remain just a backup and will serve as the only persistent cluster state source. In the remote only persistence mode, the nodes will only maintain in memory state and will always rely on remote state as the source of truth.

High Level Publish/ApplyCommit Process

Any API which mutates the cluster state, is executed by the ClusterManagerService class. The ClusterManagerService executes the task and create a ClusterChangedEvent which contains the old and the new cluster states.
ClusterManagerService calls the Coordinator instance with the ClusterChangedEvent. The Coordinator class does the heavy-lifting of orchestrating the 2 phase commit process of the cluster state. It synchronizes the publish process using a mutex so that a single cluster state publish happens at once.
[New] Break the cluster state into smaller entity metadata. Compare the entity metadata in the new cluster state with the entity metadata in the previous cluster state and find the entities which have been modified.
[New] Upload the modified entity metadata in parallel to blob store. Create the marker file after successfully uploading all the entity metadata.
The Coordinator then creates the cluster state PublishRequest and sends the same to PublicationTransportHandler. The PublicationTransportHandler manages all the incoming and outgoing cluster state publish transport calls.
The PublicationTransportHandler decides whether to send the full or the diff cluster state and then invokes to the transport call for each node.
The target node receives the PublishRequest and stores the cluster state on local disk and return a PublishResponse
Once the active cluster-manager receives the PublishResponse, it checks whether quorum of responses is reached.
[New] If the quorum is reached, it uploads marker file with status of cluster state as committed.
The active cluster-manager then sends the ApplyCommitRequest on transport layer to all the nodes.
The target node receives the ApplyCommitRequest and marks the local disk state as committed.
Once all the target nodes have responded with ack for ApplyCommitRequest, the cluster state is fully published. Then cluster state applier is invoked to apply the new cluster state.

Disaster Recovery/Restore

Remote Store Restore currently works assuming indices metadata is present in the cluster state. Since IndexMetadata will now be available in remote in Phase 1 we can use it to recover indices which have been removed from cluster state after disaster recovery.

In Phase 1, we add the support to manually restore cluster state via the existing Remote Store Restore API. Support for auto restore from remote will be added in upcoming Phases.

High Level Restore Flow

[UPDATE] We modify the Remote Restore flow to first download the latest cluster state(IndexMetadata in Phase 1) from remote.
[NEW] Various checks, like checksum and version compatibility, are performed on the downloaded blobs to validate correctness.
After correctness is ensured, we perform various index level validations to ensure IndexMetadata can be merged into the existing local cluster state.
After all validations, existing remote store restore flow will continue to execute and restore indices with remote as recovery source using the IndexMetadata downloaded from remote for indices which were are lost during failure.
Restore flow triggers ClusterState update event and updates the persisted cluster state.
[NEW] Finally the updated Publish/ApplyCommit flow is triggered to upload the new state to remote.

Reconciliation

The local state on the disk and remote state will be continuously reconciled to ensure correctness. Pending further details.

[Modify] Remote Store Restore API

curl -X POST "https://localhost:9200/_remotestore/_restore" -ku admin:admin -H 'Content-Type: application/json' -d'
{
  "cluster-uuid": "dsgYj10Nkso7",
  "indices": ["my-index"], // the indices param will be ignored if cluster-uuid is present
}
'

A new mandatory param cluster-uuid will be added in the _remotestore/_restore API. When cluster-uuid is passed, the entire cluster state will be restored. In the first phase, we will be backing up only index metadata. So only index metadata will be restored. Internally, the cluster state with the highest term+version for the cluster UUID provided in the request will be fetched for restoration and the entire index metadata should be restored and not just a few selected indices.

Next Steps

On the basis of the feedback from the community, we will create a detailed design for this feature.

FAQs

1. What is cluster state and what does it contain ? Cluster state is an object keeps track of a lot of current state about the cluster. It is managed by the active-cluster manager node. Any creation of index or data-streams first updates the information in the cluster state. The cluster state appliers listen to the changes in the cluster state and call appropriate actions to apply them on the cluster. The cluster state is loaded in memory on every nodes and also persisted on the disk as a lucene index. The cluster state includes the following information (not-exhaustive):

Index metadata which includes index settings and index mappings
Aliases and data streams information
Index and data stream templates
Information about all the shards and their allocation
Details about the cluster level read/write blocks
Details about the nodes in the cluster and the shards allocated to them

2. How is cluster state persisted currently ? PersistedClusterStateService has the logic of persisting the cluster state to disk and reading it back from disk. It uses the NIOFSDirectory implementation of Directory interface to setup a local directly. Lucene IndexWriter uses this Directory to create Lucene index and add documents to it. Although cluster state contains a lot of information, only the metadata object in the cluster state is persisted on disk. This is probably because the routing table and routing nodes information can be reconstructed on the basis of nodes and shards present in the cluster. The metadata is stored in a lucene index. The global metadata is stored in one lucene document while the each index metadata is stored in one lucene document (reference). This is used by cluster-manager-eligible nodes to record the last-accepted cluster state during publication. The metadata is written incrementally where possible, leaving alone any documents that have not changed.

Tagging @dblock @andrross @itiyamas @sohami @Bukhtawar for feedback

@soosinha @shwetathareja We've got an issue discussing a safety/correctness checker: #3866 Have you looked at creating (or updating the AL2.0-licensed models from pre-fork Elasticsearch) any kind of formal model? This is critical stuff and easy to get wrong in subtle ways and a formal model could give confidence that we're not introducing correctness issues.

This would eventually pave the way to make OpenSearch stateless where cluster manager nodes would only use remote state as the source of truth.

I could be wrong, but I suspect we'll have a hard time building a stateless solution on top of the object store-based repository interface given the object stores aren't really designed for doing things like a strongly consistent check-and-set operation. Maybe we don't need to address this just yet, but I'm thinking we may eventually need another abstraction over a consistent data store for something like this.

Thanks @andrross for the feedback.

Yes, we plan to run the existing TLA+ model for correctness. For Phase1, it shouldn't require any changes where it is running remote backup mode. For subsequent phases when it runs in remote only mode, we will update it accordingly. Thanks for the callout. @soosinha this should be called out explicitly.

I could be wrong, but I suspect we'll have a hard time building a stateless solution on top of the object store-based repository interface given the object stores aren't really designed for doing things like a strongly consistent check-and-set operation. Maybe we don't need to address this just yet, but I'm thinking we may eventually need another abstraction over a consistent data store for something like this.

Yes you are right, we will need a strongly consistent store with conditional updates when we want to achieve stateless solution where we don't need 3 or more cluster manager nodes for quorum. The intention is to first provide the durability using the same object store (with minimal friction instead of adding new repository type) used for the shard (segment/ translog) storage. The cluster manager still depends on the existing coordination code for quorum and correctness while using remote storage for metadata management instead of local for better durability. Eventually, it should move away from any local coordination and just work with remote store. At that time, it will not need 3 cluster manager nodes and can work with leader + backup (for high availability). We will also need to handle consistency of in memory data structures like shard routing information, in progress snapshots etc. which are not persisted on disk today.

We are going ahead with the initial implementation for 2.10 release. Meta issue for task list: https://github.com/opensearch-project/OpenSearch/issues/9344

A new mandatory param cluster-uuid will be added in the _remotestore/_restore API. When cluster-uuid is passed, the entire cluster state will be restored. In the first phase, we will be backing up only index metadata. So only index metadata will be restored. Internally, the cluster state with the highest term+version for the cluster UUID provided in the request will be fetched for restoration and the entire index metadata should be restored and not just a few selected indices.

I don't think thats asking the users to maintain a cluster-uuid is reasonable, we are typically punting metadata durability semantic to the user rather than maintaining it on the remote store and ensuring it is the only custodian. This will lead to races in the system which might lead to bugs/data loss if the user creates or modifies metadata between resurrecting the cluster and restoring with a cluster uuid. I would suggest avoiding taking cluster uuid as an input from the user.

This will lead to races in the system which might lead to bugs/data loss if the user creates or modifies metadata between resurrecting the cluster and restoring with a cluster uuid. I would suggest avoiding taking cluster uuid as an input from the user.

This is not the long term proposal. This is only Phase 1 plan to align with 2.10 timelines where index metadata is being backed to remote for durability but local state is still the authority for cluster coordination. Today, QL recovery is completely manual. During the Quorum loss recovery process, the cluster UUID changes which is immutable information once the cluster is bootstrapped. The next phase proposal (mostly in 2.11) is to have auto restore flow where cluster manager nodes will pull the cluster state from remote first and then only proceed with cluster bootstrap or QL recovery. There will only be one cluster UUID to ensure consistency.

Today, the index metadata recovery will not proceed in case there is an index with same name. So, that ensures there is no data loss. User would have to delete existing indices with same name before the indices can be restored from remote store which would be an atomic operation. For 2.10 release, it is not backing/ restoring other metadata like data stream/ aliases which can anyway impact search/ ingestion for the user if they are using data stream or aliases as abstraction.

Basically the flow would be:

Any new cluster manager node is bootstrapped, it checks the remote store for {cluster-name}/cluster_state path for existing state to be present.
In case, there is no UUID present, it assumes that it is first time bootstrapping and proceed with cluster formation
But, if finds an UUID, that clusterUUID will be used for bootstrapping the cluster.
There could be race condition around when it reads the remote store but in order to form the cluster successfully, majority of cluster manager nodes should have read the same state.
Also to write to remote store requires votes from majority of cluster manager nodes
Also, in case of quorum loss when there are surviving cluster manager nodes, their voting configuration would need to be wiped off, otherwise it will never be able to join the cluster formed by new nodes. e.g. n1, n2, n3 are 3 cluster manager nodes. n2 and n3 were permanently lost. n4 and n5 were bootstrapped as new cluster manager nodes, they will fetch the state from remote and form the cluster but n1 would never join them automatically.
In general it would require changes in QL recovery scripts.

I thought more on how to get rid of taking ClusterUUID in the restore API for 2.10 launch which anyway should go away with auto restore etc.

To give a bit of background on ClusterUUID ClusterUUID is the unique identity of the cluster. It is generated when the cluster is bootstrapped successfully for the first time. Also every time, the quorum of cluster manager nodes change forcefully (not automatically) in the cluster, a new ClusterUUID is generated to indicate the same. This is done to ensure correctness. e.g. n1, n2, n3 are 3 nodes in the cluster. Due to hardware failure n2 and n3 are replaced, n1 is the only node left behind. This cluster has lost it quorum. Now, n4 and n5 come up (in the place of n2 and n3). Now n1, n4 and n5 will form the cluster forcefully with manual intervention and will generate a new ClusterUUID. In case the ClusterUUID was kept same then, this breaks the invariant that states, in order to change the quorum automatically, you need the quorum of nodes to approve the change i.e. in the previous example in order to remove n2 and n3 from quorum of nodes, those should be alive which is not possible when the nodes go down in unplanned way. And, hence ClusterUUID is changed during Quorum loss (QL) recovery to indicate that the quorum was changed forcefully.

Now when a cluster has gone through multiple QL recoveries, it would result in multiple UUIDs being created in remote store. During the automatic restore, the system doesn’t know which ClusterUUID to be used to restore. One option is to use simple timestamps but the clocks could be skewed across nodes and hence can’t alone decide which is the latest clusterUUID.

Each ClusterUUID can maintain in its marker/ manifest files, the previous ClusterUUID as well. But, this is not enough to find the latest UUID as there can be race conditions where multiple UUIDs can land up in remote with same previous ClusterUUID. At that time, we need a deterministic algorithm to resolve the race condition. e.g.

n1,n2,n3 nodes bootstrapped with cluster uuid as UUID1.
All 3 nodes are replaced with n4,n5,n6.
Now, n4 is elected as leader and its thread which is uploading to remote got stuck for UUID2.
n5 & n6 got replaced with n7 and n8. n7 and n8 form the quorum with uuid as UUID3
When n7 with UUID3 checked the remote only UUID1 was present, so it marked UUID1 as its previous uuid.
But, now n4 node which was stuck also uploaded to remote with UUID2 with previous uuid as UUID1.
Now, next time QL occurs it wouldn’t know which UUID to follow UUID2 or UUID3.

The proposed algorithm to find the latest clusterUUID during automatic restore of metadata from remote state is:

The cluster manager nodes would go through normal cluster bootstrapping. Once, one of the node has received the quorum of votes. Before publishing the state to all nodes where it is the leader it will
1. Check the remote to see if the current UUID exists. If yes, then it continues to publish in the same UUID. Keeps a setOnce for remoteClusterUUID. RETURN.
2. In case the current UUID is not present. It knows it is uploading for the first time.
  1. It will check if there are other UUIDs present
    1. If no, then it will simply upload manifest with current UUID, RETURN.
    2. In case there are one or more UUIDs present
      1. It will list the latest manifest in all the UUIDs
      2. [Edited] Any UUID is considered invalid if
        
        It doesn't have at least one cluster state which was committed successfully.
      3. It will discard any invalid UUIDs
      4. It will make a list of valid UUIDs and from their latest manifest file it will fetch which the previous UUID that came before it. The previous UUID is stored in the manifest file.
      5. Then it will make the directed path/ chain something like UUID4 -> UUID3 -> UUID2 ->UUID1 -> null. There will be only one valid directed path containing all valid UUIDs. It will pick the latest UUID as UUID4, mark it as previous UUID in the manifest and upload the current UUID. It will also restore the indices from previous UUID. RETURN.
      6. In case there are more than one valid path, It will trim the UUIDs which have exactly same state with the previous UUID i.e. all the metadata matches and equals method returns true. This will reduce the no. of valid UUIDs.
        
        After trimming as well, if there are more than one valid path, then it will throw ERROR. EXIT
It will publish the state successfully to all nodes. In the PreCommit phase after uploading to remote successfully, it will trigger the purging of all the older UUIDs except the latest one which is the currentUUID in async manner.

Any limits to how large a cluster state can get? Is there a future with infinite shards in a cluster, since remote storage can theoretically support that?

@sathishbaskar As the first step, we are using remote storage for providing durability to OpenSearch clusters. There will be no cluster state or data loss with remote storage enabled. It applies to all cluster configurations including single node clusters.

Yes, this will evolve into giving scaling benefits like indices metadata will be lazily downloaded on the data nodes on demand when an index is assigned to them. This will help in scaling metadata across large no. of nodes. Now, scaling to infinite shards requires more optimizations around how shards are handled via RoutingTable in cluster state besides metadata. e.g. this is one improvement - https://github.com/opensearch-project/OpenSearch/issues/8098 more such improvements are in the pipeline.

opensearch-project / OpenSearch