[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. - Githubissues

opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.

https://opensearch.org/docs/latest/opensearch/index/

Apache License 2.0

9.7k stars 1.8k forks source link

[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988

Closed mch2 closed 2 years ago

mch2 commented 2 years ago

With segment replication we would like to avoid situations replicas contain a segment that is different from the primary's version. After a read-only replica is promoted as the new primary, we will need to index operations that exist in its xlog that do not exist in the index and make them searchable. The presence of these ops in the replica's xlog means the previous primary had indexed the op, and not finished pushing out the latest segments to any/all segments before failure.

As suggested in #2212, to avoid this situation we would like to implement a best-effort approach to select the furthest ahead replica as the new primary and avoid reindexing.

https://github.com/opensearch-project/OpenSearch/issues/2212#issuecomment-1176493330 suggests that we can accomplish this by extending PrimaryShardAllocator's async fetch, that fetches which shards are in sync, to include checkpoint data from each shard when selecting a new primary.

dreamer-89 commented 2 years ago

Set of crude steps for this task

[x] Write a unit test with 1p and 2 replica on different checkpoints. Fail primary and check replica is promoted irrespective of replica's checkpoint state.
[x] Update PrimaryShardAllocation.makeAllocationDecisions to include the checkpoint info
[x] Ensure unit tests above passes
[x] Milestone 1. Primary promotion with segrep happy path works
[x] Update PrimaryShardAllocatorTests to add more unit tests around recovery/failover
[x] Wait for https://github.com/opensearch-project/OpenSearch/issues/3989 or add stub to prevent failure during primary promotions
[x] Write a basic integration test mimicking unit test above.
[x] Add more integration tests with different failover scenarios/events
[x] Fix new bugs if any

dreamer-89 commented 2 years ago

Below are use cases related to primary allocation

RoutingNodes.failShard. This workflow comes into picture when a primary fails on node. This chooses the replica which has highest node version and used in below scenarios
- CancelAllocationCommand. REROUTE_CANCELLED. Cancelling existing allocation/recovery
- gateway.ReplicaShardAllocator. REALLOCATED_REPLICA - Cancel existing allocation when a better replica is identified i.e. one resulting in No-Op recovery.
- ShardStateAction. Local shard failure update to cluster manager node. Local shard missing, failures during index creation/updates
Cluster reroute. Cluster reroute API allows user to move shards (including primary) from node A to B.
Shard balancing. This applies during new index creation.

dreamer-89 commented 2 years ago

It appears AllocationService orchestrates the shard allocation. It does allocation handling by using RoutingNodes (responsbile for maintaining shards routing state) and shard allocators (which perform actual shard allocation). Checking more using an integration test.

dreamer-89 commented 2 years ago

On shard failure, master first tries to promote active replica (identified from cluster state in routing Nodes) which has highest engine version. In case, there is no available replica, master waits for cluster updates to trigger primary assignment via PrimaryShardAllocator.

With this info, separate handling needs to be done for RoutingNodes.failShard workflow.

Failover scenarios:

RoutingNodes.failShard. This is used when a node is marked faulty by FollowersChecker leading to coordinator running NodeRemovalClusterStateTaskExecutor. This removes the dead nodes and fail shards from cluster using RoutingNodes.failShard followed by reroute (step 2 below).
PrimaryShardAllocator. Used during cluster reroute actions to assign unassigned shards . This is used on ClusterStateUpdates (index create/delete/open/close, shard started/closed, cluster settings update, node-join, node leave), delayed allocation routing and snapshot restore.

dreamer-89 commented 2 years ago

Evaluated option of ignoring the primary promotion in RoutingNodes.failShard (failure scenario 1 above i.e. node leaving cluster). RoutingNodes#failShard is also used for updating cluster state, cancelling recoveries etc. Ignoring logic to primary promotion in RoutingNodes.failShard lead to multiple assertion failures at different levels. Removing this logic will need multiple changes in core allocation mechanism and will be a huge effort.

dreamer-89 commented 2 years ago

PR: PrimaryShardAllocator primary promotion logic: https://github.com/opensearch-project/OpenSearch/pull/4041

Taking up RoutingNodes.failShard primary promotion logic in https://github.com/opensearch-project/OpenSearch/issues/4131

dreamer-89 commented 2 years ago

Closing this in favour of https://github.com/opensearch-project/OpenSearch/issues/4131 which tackles the second part of handling shard failure in RoutingNodes.