Closed mch2 closed 2 years ago
Set of crude steps for this task
Below are use cases related to primary allocation
move
shards (including primary) from node A to B. It appears AllocationService orchestrates the shard allocation. It does allocation handling by using RoutingNodes (responsbile for maintaining shards routing state) and shard allocators (which perform actual shard allocation). Checking more using an integration test.
On shard failure, master first tries to promote active replica (identified from cluster state in routing Nodes) which has highest engine version. In case, there is no available replica, master waits for cluster updates to trigger primary assignment via PrimaryShardAllocator.
With this info, separate handling needs to be done for RoutingNodes.failShard workflow.
Failover scenarios:
Evaluated option of ignoring the primary promotion in RoutingNodes.failShard
(failure scenario 1 above i.e. node leaving cluster). RoutingNodes#failShard is also used for updating cluster state, cancelling recoveries etc. Ignoring logic to primary promotion in RoutingNodes.failShard lead to multiple assertion failures at different levels. Removing this logic will need multiple changes in core allocation mechanism and will be a huge effort.
PR: PrimaryShardAllocator primary promotion logic: https://github.com/opensearch-project/OpenSearch/pull/4041
Taking up RoutingNodes.failShard primary promotion logic in https://github.com/opensearch-project/OpenSearch/issues/4131
Closing this in favour of https://github.com/opensearch-project/OpenSearch/issues/4131 which tackles the second part of handling shard failure in RoutingNodes.
With segment replication we would like to avoid situations replicas contain a segment that is different from the primary's version. After a read-only replica is promoted as the new primary, we will need to index operations that exist in its xlog that do not exist in the index and make them searchable. The presence of these ops in the replica's xlog means the previous primary had indexed the op, and not finished pushing out the latest segments to any/all segments before failure.
As suggested in #2212, to avoid this situation we would like to implement a best-effort approach to select the furthest ahead replica as the new primary and avoid reindexing.
https://github.com/opensearch-project/OpenSearch/issues/2212#issuecomment-1176493330 suggests that we can accomplish this by extending PrimaryShardAllocator's async fetch, that fetches which shards are in sync, to include checkpoint data from each shard when selecting a new primary.