opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.76k stars 1.82k forks source link

[RFC] Cluster State Publication using Remote State #13257

Closed soosinha closed 5 months ago

soosinha commented 6 months ago

Background

As part of the initial changes in remote cluster state, we are uploading the cluster metadata to remote store during every cluster state publication. This remote cluster metadata gets utilized by the cluster manager nodes during recovery scenario. But there is no change to way the cluster state is published to other nodes in the cluster. The cluster manager still sends the full or diff cluster state to data nodes and other cluster manager nodes over transport layer. This increases the load on the cluster manager mainly since the cluster manager has to send cluster state objects to all the nodes over transport layer for every cluster state update. This becomes one of the bottlenecks in increasing the size of the cluster in terms number of indices, number of shards and number of nodes as any change in these parameters tends to increase the size of the cluster state.

https://github.com/opensearch-project/OpenSearch/issues/11744

Proposal

Instead of sending the cluster state object over the transport layer, the cluster manager node can just send the term and version of the cluster state over the transport layer which can then be used to download the corresponding cluster state by all the nodes directly from the remote store.

Related component

Cluster Manager

Additional context

Cluster State Persistence vs Publication

Before we get into the mechanism to publish the cluster state using remote store, we should understand the some differences between the features:

Diff Computation of Cluster State in Existing Flow

The diff is computed in a different manner for each of the objects.

ClusterState {
  Metadata metadata {
    long version; // Latest data added to diff
    String clusterUUID; // Latest data added to diff
    boolean clusterUUIDcommitted; // Latest data added to diff
    CoordinationMetadata coordinationMetadata; // Latest data added to diff. But this
        // should not be huge it contains mostly the nodeIds for voting configuration 
        // and voting exclusions.
    Settings transientSettings; // Latest data added to diff
    Settings persistentSettings; // Latest data added to diff
    DiffableStringMap hashesOfConsistentSettings; // Diff is computed for the map by 
        // finding the inserts, updates and deletes to the map.
    Map<String, IndexMetadata> indices; // Diff is computed using DiffableUtils 
        // which adds the inserts, deletes and updates to the diff.
        // Most of the IndexMetadata is made of settings and mappings. 
        // The settings are not diffable and the diff for mappings is computed using CompleteDiff . 
        // CompleteDiff adds the entire object even if there is any small change in the object otherwise it adds empty object to the diff.
        // So if a single field is added to the mappings, entire mappings will be added to the diff since the mappings are compressed.
        // Size of Settings will generally be small but in cases where the in-line analyzers are present, the size of settings can be large.      
    Map<String, IndexTemplateMetadata> templates; // diff computed using DiffableUtils
        // But for IndexTemplateMetadata, diff is computed using CompleteDif.
        // So values in the upserts map of the diff will contain entire index templates which have been added or updated.

    Map<String, Custom> customs; // diff computed using DiffableUtils.
    // So the customs which are changed will be added to the upserts map in the diff and the customs which are deleted will be added to deletes list.
  }

  String stateUUID - from and to UUID are added in diff
  long version - from and to version added in diff

  RoutingTable routingTable {
    long version
    Map<String, IndexRoutingTable> indicesRouting; // diff computed using DiffableUtils
        // So the indices whose routing routing table is deleted, they will be added to a deletes list.
        // The indices which are added are updated, the routing table for those indices will be added to upserts map.

  }

  DiscoveryNodes nodes;
    // Diff computed using CompleteDiff. 
    // CompleteDiff sends empty object if the previous and current are same. 
    // It sends full object if the previous and current are different.

  ClusterBlocks blocks;
    // Diff computed using CompleteDiff

  Map<String, Custom> customs;
    // Diff computed using DiffableUtils. 
    // But all the Custom classes implement the AbstractNamedDiffable class. 
    // So when DiffableUtils computes the diff, the entries will added to the upserts map. 
    // The values in the upserts map will be CompleteDiff which will have empty object or full objects.

}

Summary

The computation of diff is not very optimized. For index metadata, we compute the index metadata for the indices which have changed. This diff of index metadata will always contain full settings. It will contain full mappings even when there a single field updated in the mapping. This computation of diff is very similar to the incremental metadata that we upload currently where we upload the entire index metadata whenever there is a small change in it. For routing table, even if we upload the diff, the code will be updating for the full routing table for each of the indices which have changed. So this would be the same as the existing approach which is designed for routing table - split the routing table by indices and update them individually as and when they change. So we know that we have to anyway upload all the incrementally changed objects for every cluster state update. If we upload the diff as computed currently which is not optimized fully, we will be uploading a lot of redundant data. So we can just go with the approach uploading an additional object which would help in identifying which components in the cluster state have changed. This object would then be used by the follower nodes to download the incremental changes and apply to the cluster state.

Approach

Based on the above analysis, we need to achieve the below tasks for the remote state publication to work.

Upload Ephemeral Objects to Remote Store

Currently we only publish the Metadata object in the cluster state to remote store. This is sufficient for the durability/persistence use case as the rest of the objects can be recreated by the cluster manager node. But for cluster state publication all the other objects as well. So we would need to publish all the objects to remote store. Listing the extra objects to be published:

Upload a Diff Object

In order to to publish the cluster state, we will first publish a diff file to the remote store. Based on the analysis for diff calculation above it is not optimal to publish a single diff blob to the remote store since in the cases where multiple indices have been changed, the diff blob would contain the entire metadata of all these indices which will make the diff file large. Instead the diff file would just contain the the information about which of the objects that have changed. Based on this information, the file locations for these objects can be fetched from manifest and downloaded. The diff file would look like below:

{
  "from_state_uuid": "wx1urb8cdr",
  "to_state_uuid": "vksot42ta0",
  "metadata_diff": {
    "coordination_metadata_diff": {
      "diff": true
    },
    "settings_diff": {
      "diff": true
    },
    "transient_settings_diff": {
      "diff": true
    },
    "hashes_of_consistent_settings_diff": {
      "diff": false
    },
    "indices_diff": {
      "upserts": ["index2", "index3"]
      "deletes": ["index1"]
    },
    "templates_diff": {
      "diff": true
    },
    "customs_composable_templates_diff": {
      "diff": true
    },
    "customs_repositories_diff":{
      "diff": true
    }
  }
  "discovery_nodes_diff": {
    "diff": false
  },
  "blocks_diff": {
    "diff": true,
  },
  "customs_snaphots_in_progress_diff": {
    "diff": false
  },
  "customs_restore_in_progress_diff": {
    "diff": false
  },
  // routing table will handled separately
}

This diff object can be part of the ClusterMetadataManifest itself as it should not be very large.

Cluster Manager Node Publishes the term and version

Current state: In order to publish the cluster state the cluster manager node sends the diff state to all the nodes over the transport layer. If a follower node cannot apply the diff it responds with an exception and then the cluster manager sends the full state to the node. When a node is trying to join the cluster, the cluster manager sends the cluster state in a validate join request to the joining node. Proposed: The cluster manager node will send only the term and version of the cluster state. The follower node will download the manifest file for the term and version. From the manifest file, the node will get the diff and determine if the diff can be applied. If the diff cannot be applied, it will download all the files for the new cluster state. If the diff can be applied, it will find the objects which have changed and then download the files for all the objects and apply on the old cluster state. In the validate join request as well, the term and version will only be sent. The joining node would download the full state corresponding term and version and perform the validations. The OpenSearch version of the joining node will be checked to ensure we have published the cluster state for this version. New transport request will be created for the above changes.

Considerations

Version Specific Serialization of Cluster State

When the cluster manager is on a certain OpenSearch version, there can be some follower nodes which are on a different version. So the cluster state is sent over the tranport layer by serializing the cluster state according to the target node’s version. So the serialization should is done using writeTo method since this method is version aware. The cluster manager node would have a visibility into the different versions of nodes present in the cluster. It should publish the cluster state for all the versions of nodes present in the cluster. Ideally this scenario should only happen during version upgrade scenario, so there should be a maximum of 2 different versions present at a time in the cluster. So we need analyze and decide how should the serialization be done for remote state publication

peternied commented 6 months ago

[Triage - attendees 1 2 3 4 5 6 7] @soosinha Thanks for creating this RFC, look forward to see results from this discussion