opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.73k stars 1.8k forks source link

[RW Separation] Change search replica recovery flow #15952

Open mch2 opened 1 month ago

mch2 commented 1 month ago

Search replicas are currently configured to recover with the Peer recovery flow. To ensure these shards have reduced (node-node replication) and zero (remote backed) communication with primary shards during recovery we need to change this.

In both cases, I think we can initialize the shards with store recovery steps and then run a round of replication before marking the shards as active. I think we may also need to filter out these allocation Ids from cluster manager updates to the primary, as they should not care about tracking them as in-sync copies.

prudhvigodithi commented 1 week ago

Hey @mch2 from my understanding I have summarized the following on how we can remove the Peer recovery for the search replicas, please check

Filtering Allocation IDs in Cluster Updates:

In OpenSearch, the cluster manager tracks which shards are in-sync with the primary shard (we can check using /_cluster/state). This helps in ensuring data consistency across the cluster. The proposal suggests filtering out the allocation IDs of search replicas from these updates. Since search replicas are read-only and do not participate in writes, the cluster manager doesn’t need to track their state as in-sync copies like it does for regular replicas.

We can create a seperate child issue to track this?

Search Replica Recovery

Check Local Storage and Recover from Local Disk

The replica checks its own disk to see if it already has the necessary segment files from previous operations or previous shard instantiations. If the required data is on the search replica’s local disk, it uses that to recover the shard, skipping any communication with the primary node.

Recover from Remote Storage (if configured):

If the data is not available on the local disk, the replica can pull the data from remote-backed storage to recover the shard. We can even consider skipping the local disk check if Remote Storage is configured.

Replication Process (optional): Can be triggered after the recovery or can wait for next refresh interval

After initializing the replica via store recovery (local or remote), the process can start to ensure the shard is synchronized with the latest segments.

Thank you @getsaurabh02

prudhvigodithi commented 1 week ago

So in example API curl -X GET "http://localhost:9200/_cluster/state/routing_table?filter_path=routing_table.indices.my_sample_index" -H 'Content-Type: application/json' | jq '.' I can see the "state": "STARTED" for "searchOnly": true, is the proposal to remove this so that cluster manager need not track the searchOnly replicas?

{
  "routing_table": {
    "indices": {
      "my_sample_index": {
        "shards": {
          "0": [
            {
              "state": "STARTED",
              "primary": false,
              "searchOnly": false,
              "node": "BfhYEGSERsOMVxXa1BTFFA",
              "relocating_node": null,
              "shard": 0,
              "index": "my_sample_index",
              "allocation_id": {
                "id": "6TMmzm-6S0ip36g5VsWSog"
              }
            },
            {
              "state": "STARTED",
              "primary": false,
              "searchOnly": true,
              "node": "yOfxLMoGQPeVJ1lcHNPMrw",
              "relocating_node": null,
              "shard": 0,
              "index": "my_sample_index",
              "allocation_id": {
                "id": "icMnWUYjTReUN0J2H6UUlA"
              }
            },
            {
              "state": "STARTED",
              "primary": true,
              "searchOnly": false,
              "node": "KCszv1QKSZ-HHqzuAcoYbg",
              "relocating_node": null,
              "shard": 0,
              "index": "my_sample_index",
              "allocation_id": {
                "id": "Y6CRS10hT26SbpkdaiwMsQ"
              }
            }
          ]
        }
      }
    }
  }
}