opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.56k stars 1.75k forks source link

[Segment Replication] Update shard allocation to evenly distribute primaries. #5240

Closed mch2 closed 1 year ago

mch2 commented 1 year ago

With segment replication primary shards will be using more node resources than replicas. While still net less than docrep, this will lead to an uneven utilization of resources across the cluster.

We need to explore updating allocation with segrep enabled to evenly balance primary shards across a cluster.

dreamer-89 commented 1 year ago

Looking into it

dreamer-89 commented 1 year ago

Tried a quick exercise to see existing shard allocation behavior. All shards were distributed evenly but not primary alone. This is due to existing EnableAllocationDecider but even after enabling rebalance based on primaries the result did not change. Digging more into existing ShardAllocation mechanism.

Create 6 indices with SEGMENT as replication type.

{
  "settings": {
    "index": {
      "number_of_shards": 5,  
      "number_of_replicas": 1,
      "replication.type": "SEGMENT" 
    }
  }
}

The primary shard allocation on nodes remains uneven, even after using cluster.routing.rebalance.enable setting.

➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-1" | grep " p " | wc -l
      13
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-2" | grep " p " | wc -l
      13
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-0" | grep " p " | wc -l
       9

➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-0"  | wc -l     
      24
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-1"  | wc -l
      23
➜  OpenSearch git:(main) ✗ curl "localhost:9200/_cat/shards?v" | grep "runTask-2"  | wc -l
      23

I retried the exercise after enabling rebalancing on primaries but that did not change the result.

{
  "persistent" : {
    "cluster.routing.rebalance.enable": "primaries"
  }
}
dreamer-89 commented 1 year ago

cluster.routing.rebalance.enable gates which shard type can be rebalance but doesn't govern the actual allocation. The existing allocation is based on shard count (irrespective of primary/replica) and is govern by WeightFunction inside BalancedShardAllocator

dreamer-89 commented 1 year ago

There are couple of approaches to solve this:

1. Handle Segment replication enabled shards differently

Add new parameters in WeightFunction primaryBalance and primaryIndexBalance which is used for weight calculation of segment replication enabled shards.

primaryBalance -> Number of primary shards on a node primaryIndexBalance -> Number of primary shards on a node of specific index

Pros

Cons

2. Change allocation logic for all type of shards

Change existing WeightFunction for all type of shards. The algorithm introduces another primaryBalance factor to ensure even primary shard distribution followed by replicas.

Pros:

Cons:

[Edit]: Based on above approach 2 seems promising as it simpler and doesn't handle SegRep indices separately and thus, avoid collation of two different algorithms for balancing. This approach can be made less intrusive by not using this new algorithm by default.

Bukhtawar commented 1 year ago

I think we need to associate default weights per shard based on various criteria f.e primary/replica or remote/local such that these weights are representative of the multi-dimensional work(compute/memory/IO/disk) they do relative to one another. This will ensure we are able to tune these vectors based on the heat dynamically as well(long term), once indices turn read only or requests distribution shifts on different shards based on custom routing logic.

dreamer-89 commented 1 year ago

I think we need to associate default weights per shard based on various criteria f.e primary/replica or remote/local such that these weights are representative of the multi-dimensional work(compute/memory/IO/disk) they do relative to one another. This will ensure we are able to tune these vectors based on the heat dynamically as well(long term), once indices turn read only or requests distribution shifts on different shards based on custom routing logic.

Thanks @Bukhtawar for the feedback and the suggestion. This is definitely useful in larger scheme of things that we may need to implement. It will need more thoughts, discussion and probably overhaul of existing allocation. I am thinking of starting with updating existing BalancedShardAllocator and tune weight function to incorporate segment replication semantics. We can re-iterate and make model more robust in follow up. Please feel free to open an issue (with more details?) and we can discuss more.

dreamer-89 commented 1 year ago

Requirement

Ensure even distribution of same index primaries (and replicas). Replicas also need even distribution because primaries for an index may do variant amount of work (hot shards), so does corresponding replica.

Goals

As part of phase 1, targeting below goals for this task.

  1. Shard distribution resulting in overall balance
  2. Minimal side effect on rebalancing
  3. Ensure no negative impact.
    • Transient or permanent unassinged shards (red cluster).
    • Comparatively higher shard movements post allocation & rebalancing. Node join & drop etc

Assumption

Balancing primary shards evenly for docrep indices does not impact the cluster performance.

Approach

As overall goal is to have uniform resource utilization among nodes containing various type of shards, we need to use same scale to measure different shard types. Using separate allocation logic for different shard types, working in isolation will NOT work! I think it make sense to update the existing weightFunction factoring in primary shards which solves use case here and is simpler to start with. This will be the initial work to evolve the existing weighing function and incorporate different factors which might need allocation logic overhaul. As discussed here, proceeding with Approach 2 which introduces a new primary shard balance factor.

Future Improvement

The weight function can be updated to cater future needs, where different shards have in-built weight against an attribute (or weighing factor). E.g. primary with more replicas (higher fan out) should have higher weight compared to other primaries; so that weight function prioritize even distribution of these primaries FIRST.

POC

Tried a POC which introduces a new setting to balance shards based on primary shard count and corresponding update of WeightFunction.

[Edit]: This can still result in skewness on segrep shards i.e. non-balanced primary segrep shards on node though overall primary shard balance is within threshold. Thanks @mch2 for pointing this.

There are two already existing balance factors. This POC adds primaryWeightShard below to consider primary shards for allocation

  1. INDEX_BALANCE_FACTOR_SETTING defines shard count per node for an index. This ensures shards belonging to an index are distributed across nodes.
  2. SHARD_BALANCE_FACTOR_SETTING defines shard count per node. This quantifies total number of shards per node is balances across the cluster.
  3. PRIMARY_BALANCE_FACTOR_SETTING defines primary shard count per node. This ensures primary shards per node is balanced.
    
    float weight(ShardsBalancer balancer, ModelNode node, String index) {
    final float weightShard = node.numShards() - balancer.avgShardsPerNode();
    final float weightIndex = node.numShards(index) - balancer.avgShardsPerNode(index);
    final float primaryWeightShard = node.numPrimaryShards() - balancer.avgPrimaryShardsPerNode(); // primary balance factor

return theta0 weightShard + theta1 weightIndex + theta2 * primaryWeightShard; }



There is no single value for PRIMARY_BALANCE_FACTOR_SETTING which distributes shards satisfying all three balance factors for every possible cluster configuration. This is true even today with only INDEX_BALANCE_FACTOR_SETTING and SHARD_BALANCE_FACTOR_SETTING. A higher value for primary shard balance factor may result in primary balance but not necessarily equal shards per node. This probably needs some analysis on distribution of shards for different constant weight factors. Added a subtask in description.

@nknize @Bukhtawar @kotwanikunal @mch2 : Request for feedback 
dreamer-89 commented 1 year ago

For subtask 2 mentioned here

Previous change in #6017 introduces primary shard balance factor but doesn't differentiate on shard types (docrep vs segrep); with end-result of overall balanced primary shard distribution but not necessarily for individual shard types. Example, the nodes removal results in 4 primary unassigned shards (docrep 2 and segrep 2), there are chances of one node getting both docrep shard while other getting segrep shards. Changing this logic to accomodate segrep index is not required because:

  1. LocalShardsBalancer sorts priortises primary first, for allocation which means balanced distribution across nodes. Also, it is not possible to create multiple indices at once, so it always deals with single indices at a time.
  2. LocalShardsBalancer logic is meant for newly created indices only. Failed shard is handled by RoutingNodes#failShard which promotes in-sync replica copy (if any); while existing shard allocator (GatewayAllocator) promotes an upto date version of shard as primary
  3. The rebalancing logic is still applied post failover scenario (in step 2) to keep balanced primary shard distribution after #6017

For logging purpose, added an unsuccessful test to mimic scenario where LocalShardsBalancer allocation logic is applied here

dreamer-89 commented 1 year ago

Tracking remaining work in https://github.com/opensearch-project/OpenSearch/issues/6210 for benchmarking, guidance on default value and a single sane default value (if possible). Closing this issue.