[Segment Replication] Allocation changes to distribute primaries - Benchmark & document improvements using opensearch-benchmark.

https://github.com/opensearch-project/OpenSearch/pull/6017 Introduced a new weight factor to attempt to evenly distribute primary shards. We need to document this change including benchmarks to 1) find a reasonable default for SR enabled indices and 2) prove when using the setting is beneficial.

@shwetathareja's comment from PR https://github.com/opensearch-project/OpenSearch/pull/6325

My initial thoughts:

Have you looked into adding an AllocationConstraint similar to isIndexShardsPerNodeBreached which essentially adds constraint to ensure each node shouldn't have more than avg. no.of shards per index even if a node has overall less shards compared to other nodes. This helps in preventing un-necessary rebalances later. More details are here - https://github.com/opensearch-project/OpenSearch/issues/487

I feel you can simply add a constraint for segrep indices to ensure a node shouldn't have more than avg. no. of expected primaries for an index. Lets say it is 5 node cluster and segrep index has 5 primaries + 2 replicas. So no node should have more than 1 primary. This wouldn't require coming up with magic value for weight factor. Later, we can evaluate if we want to extend this to balancing sum of primaries across all indices across all nodes as well. e.g. 1 segrep index has 3 primaries while other has 2 primaries and it is 5 node cluster, then each node shouldn't have more than 1 primary across indices as well.

@shwetathareja's comment from PR #6325

My initial thoughts:

Have you looked into adding an AllocationConstraint similar to isIndexShardsPerNodeBreached which essentially adds constraint to ensure each node shouldn't have more than avg. no.of shards per index even if a node has overall less shards compared to other nodes. This helps in preventing un-necessary rebalances later. More details are here - #487

I feel you can simply add a constraint for segrep indices to ensure a node shouldn't have more than avg. no. of expected primaries for an index. Lets say it is 5 node cluster and segrep index has 5 primaries + 2 replicas. So no node should have more than 1 primary. This wouldn't require coming up with magic value for weight factor. Later, we can evaluate if we want to extend this to balancing sum of primaries across all indices across all nodes as well. e.g. 1 segrep index has 3 primaries while other has 2 primaries and it is 5 node cluster, then each node shouldn't have more than 1 primary across indices as well.

Yes, this is a great idea! I think having a constraint (here, average per index primary shard count per node let's call it C1) supplements the existing weight function. The existing weight function is incognizant of contribution from different factors (shard count, same index shard count) which allow shard skeness under certain conditions even with primary shard weight factor (added in #6017). There are situations when nodes are balanced based on weight calculation but not necessarily when considering primaries alone. It is posible for one factor to sabotize weight contributions from other parameters. E.g. 2 nodes setup with 25 shards and 1 replica can have Node1 10 primary, 15 replicas, Node2 15 primary, 10 replicas, i.e. uneven primary but overall balanced. With constraints, assigning fairly large weights when these constraints are breached, will be more useful in solving uneven primary shard distribution from failover and the balancing logic.

Proposal

Approach 1.

Introduce average primary shard count constraint in existing allocation & rebalacing logic which ranks nodes based on weights and selects one with minimum weight as allocation and/or rebalancing target. The node is marked breaching contraint when it contains more than average primary shards of an index. Such nodes are penalized by adding a very high constant to it weight calculation, which results in lower chances of node being selected as allocation or rebalancing target. E.g. For 4 node 10 shard 1 replica setup, C1 (average per index primary shard count per node) will be 2.5 (10/4)

Special consideration: During rebalancing, when this constraint is breached, it is advantageous to move primary first Vs random shard to converge towards smaller delta (weight diff b/w heavist & lightest). With balancing is performed at index level, this approach ensures there is uniform primary shard distribution per index when rebalancing concludes.

Tried a quick POC by updating weight function(as coded below), and it seems to work based resulting in balanced primary shard distribution

float weight(ShardsBalancer balancer, ModelNode node, String index) {
    final float weightShard = node.numShards() - balancer.avgShardsPerNode();
    final float weightIndex = node.numShards(index) - balancer.avgShardsPerNode(index);

    int extraWt = 0;
    if(node.numPrimaryShards(index) - balancer.avgPrimaryShardsPerNode(index) > 0) {
        int extraPrimaryCount = (int)(node.numPrimaryShards(index) - balancer.avgPrimaryShardsPerNode(index));
        extraWt += extraPrimaryCount * 1000;
    }
    return theta0 * weightShard + theta1 * weightIndex + extraWt;
}

Pros.

Simple implemention
Easier to comprehend

Cons.

More relocations needed to attain balanced distribution compared to existing world where primary and replica are equal. In case where primary balance is desired, there can be primary shard movements (to satisfy primary constraint C1) followed by shard (probably replica) movement to have same shard count a. For workloads, where replica is light weight and it's balance is not essential; a high threshold can be set which results in delta threshold to trip only when primary shards are unbalanced. b. During primary shard relocation consideration, if target node already contains the replica shard; primary & replica roles are exchanged. Discussed more in improvement 1 in the end.
Not necessarily balanced primary distribution globally. Below nodes are balanced at index level but not when considering overall primary count. This can be solved by having another constraint for total primary shard count.

Node 1	Node 2	Node 3
Index1	Index 1
	Index 2	Index 2
	Index 3
	Index 4

Edge case where cluster not primary balanced even for single index. E.g. this may happen post failover where SameShardAllocationDecider prevents primary movement from Node 1 -> Node 2 (Pi is primary and Ri is corresponding replica). This is an edge case and can be solved by improvement 1

Node 1	Node 2
P1	R1
P2	R2

Approach 2.

Perform balancing at global level across all indices and nodes once using algorithm below.

Steps

Calculate weight on all nodes using weight with constraint above.
Sort all nodes based on weight.
while(true)
- wt_diff = weight[heavy node] - weight[light node]
- If weight diff > t (default threshold of 1.0f, move random primary from heavy -> light node
  - update weight of both nodes
  - sort and continue // repeat to attain more balance
- else break // no more unbalanced

Pros.

Minimal shard relocations with overall primary balance

Cons.

Hot spots due to multiple shards of same index assigned on one node.

Approach 3.

Combine previous 2 approaches so that an overall balance is attained. First, balance the nodes for a single index (approach 1) followed by another round of balancing (approach 2).

Pros.

Result in balanced primary shard distribution both at per index and across all indices.

Cons

Additional relocations from second round.
Complicated logic
Problems such as
- Both round negating effect of other (e.g below)
- Endless loop of balancing actions.

Node 1	Node 2
Index1	Index2
Index1	Index3

First round results in moving Index1 shard to Node 2

Node 1	Node 2
Index1	Index2
	Index3
	Index1

Second round moves it back

Node 1	Node 2
Index1	Index2
Index1	Index3

Conclusion

~~Though approach 3 seems promising lucrative, relooking into approach 1 shortcomings - the primary skewness is bounded to 1 (because of primary constraint, which assigns higher weight to node with shard count > avg shard count result in even shard distribution). When there shard count is more, overall distribution is near balanced. Approach 3 seems over optimizing and the cost (complexity & effort) seems more than returns. It can be considered in future if there are use cases which demands it. Thus, apporach 1 is seems more better. Also, as this changes the core allocation logic, it is better to keep this behind a user exposed dynamic setting (default false). This is useful for users who want to use existing allocation or have other troubles with new allocation logic.~~

Approch 1 seem promising which can be extended to solve primary balancing across all indices by adding a new constraint. I will add this as follow up of https://github.com/opensearch-project/OpenSearch/pull/6422

Improvements

Coming from cons of approach 1, there can be more number of relocations (problem 1) to attain overall balance and in some situations, shard relocation is not possible (problem 2) due to SameShardAllocationDecider. Both these problems can be alleviated, if there is a mechanism to swap the primary and replica shards. Created issue https://github.com/opensearch-project/OpenSearch/issues/6481 to discuss this.

Problem 1 (More relocations)

Node 1	Node 2
P1	R3
P2	R4

Node 1	Node 2
P1	R3
	R4
	P2

Node 1	Node 2
P1	R3
R4	P2

Problem 2 (No relocation possible)

Node 1	Node 2
P1	R2
P2	R1

Existing constraints are predicates which return yes/no based on whether node has breached constraint or not. It would be better if constraint returns a degree by which a constraint was breached. The quantity can then be incorporated into node's weight calculation. Consider this with example of 4 node (N1, N2, N3, N4) and 1 index (8 shard, 1 replica) with 2 as C1 value. Nodes N3, N4 are excluded from allocation. Node N1 contains 3 primary 5 replica (breach primary threshold by 1) while Node N2 contains 5 primary 3 replica (breach threshold by 3). Here, as both nodes have unequal primary and breaching C1 and also they are balanced (equal weight because all shards belong to same index and both nodes have same number of shards). For shard allocation (increase replica count) and rebalancing, both nodes can be picked as target while it is preferable to select N1 which contains less primary.

@shwetathareja @nknize @Bukhtawar @vigyasharma : Request for feedback. Please feel free to add more folks who can help here.

Thanks @dreamer-89 for exploring the AllocationConstraint approach.

Couple of comments on different approaches Approach 1

While calculating extra weight, the current logic can result in -ve weight as well int extraPrimaryCount = (int)(node.numPrimaryShards(index) - balancer.avgPrimaryShardsPerNode(index)); Instead you can choose to not add extra weight unless the constraint is breached.
Why do you think there will be extra re-balances. This will ensure primaries are balanced across shards even though the over all shard count is lower on that node compared to others.
Yes agreed, we will need to define different constraint (or same constraint handling both conditions) for balancing overall primaries across all indices if desired which you have discussed in Approach 2 & 3.
Yes, you are right, if i remember right re-balancing was not taking care of constraint. We can evolve the Allocation algorithm for re-balances as well. I will check that part of the code again.

Approach 2 As you mentioned, it is only balancing across all indices and wont be effective per index and that would be important from log analytics customer perspective where only the latest index will be taking writes after rollover.

Approach 3 For segrep, I feel this would certainly add value considering resource usage across primary and replica is very skewed. I think we need to re-look at the rebalancing algo to tune the constraint calculation logic and prevent un-necessary movements.

Overall, I am aligned to start with per index primaries to be evenly distributed and later iterate over to make it evenly distribute across indices primaries as well.

Thank you so much @shwetathareja for the feedback! I have implemented Approach 1 which employs avg primary shard count per node constraint during allocation and rebalancing - PR #6422 (request you to review as well).

Approach 1

While calculating extra weight, the current logic can result in -ve weight as well int extraPrimaryCount = (int)(node.numPrimaryShards(index) - balancer.avgPrimaryShardsPerNode(index)); Instead you can choose to not add extra weight unless the constraint is breached.

Why do you think there will be extra re-balances. This will ensure primaries are balanced across shards even though the over all shard count is lower on that node compared to others.

Yes agreed, we will need to define different constraint (or same constraint handling both conditions) for balancing overall primaries across all indices if desired which you have discussed in Approach 2 & 3.

Yes, you are right, if i remember right re-balancing was not taking care of constraint. We can evolve the Allocation algorithm for re-balances as well. I will check that part of the code again.

The extraPrimaryCount calculation is guarded by if(node.numPrimaryShards(index) - balancer.avgPrimaryShardsPerNode(index) > 0) condition so it should not result in negative weight contribution. This logic was just for POC to verify if it works.
An extra relocation is needed to attain primary balance. This is done by first offloading extra primary on heavier node due to breached avg. primary shard count per node constraint; which may result in target node to have extra shard count and violating the threshold based of shard counts; needing another relocation (probably a replica shard). During rebalancing, the primaries are relocated first (link - from PR above) from heavier to light node. This is needed to converge delta b/w heavy & light node and to prevent never ending shard relocations. Once primary is moved, it may result shard count imbalance (weight diff higher than threshold default 1.0f) which results in one more shard relocation. This can be avoided if there is dynamic support for exchanging primary and replica shard (also called in Cons section). For example, consider there are 3 nodes with one index (3 shards, 1 replica) below.

State 1

Node 1	Node 2	Node 3
P1	P2	P3
R3	R1	R2

Node 3 is restarted, coordinator promotes R3 as primary resulting in below state. Node 3 joins back cluster containing R3, R2

State 2

Node 1	Node 2	Node 3
P1	P2	R3
P3	R1	R2

Node 1 breaches average primary shard count per node constraint 2 primary shards > 1 allowed average primary shards, thus rebalancer moves P1 from Node 1 -> Node 3 (and not Node 2 because 1. Not allowed due to (primary count diff 2. SameShardAllocator allocation decider filter). This results in below state

State 3

Node 1	Node 2	Node 3
	P2	R3
P3	R1	R2
		P1

Node 3 now have more number of shards which still breaches 1.0f default threshold. Rebalancer moves R2 back to Node 1 resulting in eventual balance. If fine tune balance is not desired, a higher threshold can be set which stops at step 3 above.

State 4

Node 1	Node 2	Node 3
P3	P2	P1
R2	R1	R3

Yes, adding another total primary shard count per node is a good idea. Let's continue disucussion here (and on PR) and come up use cases where this is needed.
Yes, rebalancing doesn't use any constraint today. With PR, I added the average primary shard count based constraint to be used during allocation and rebalancing.

Approach 2 As you mentioned, it is only balancing across all indices and wont be effective per index and that would be important from log analytics customer perspective where only the latest index will be taking writes after rollover.

👍

Approach 3 For segrep, I feel this would certainly add value considering resource usage across primary and replica is very skewed. I think we need to re-look at the rebalancing algo to tune the constraint calculation logic and prevent un-necessary movements.

I think applying both constraints (avg primary shard count per index and avg primary shard count across all indices) together will be useful as it will result in minimal relocations.

Overall, I am aligned to start with per index primaries to be evenly distributed and later iterate over to make it evenly distribute across indices primaries as well.

Thank you! Yes, per index primary balancing is implemented in https://github.com/opensearch-project/OpenSearch/pull/6422; we can add global balancing constraint as follow up based on discussion here.

Great discussion here, @dreamer-89 and @shwetathareja. This is a tricky, but important problem to solve. Happy to see the original Constraint Based Shard Allocation idea continue to find utility over the years!

I think we are on the right track here. I like the option of adding two constraints, one for "primaries per index per node", and another one for "total primary shards per node". ...

Existing constraints are predicates which return yes/no based on whether node has breached constraint or not. It would be better if constraint returns a degree by which a constraint is breached

I would strongly suggest thinking hard about how to keep the shard weight algorithm as simple as possible. It is tempting to add more capabilities, but it quickly becomes exponentially complex to reason and debug shard allocation in a production cluster. This was why, in the original impl., I had made each constraint add a fixed high value weight, and modeled constraints as boolean predicates.

As we add multiple constraints, there are no clear answers to whether breach of a particular constraint, is worse than the breach of a different constraint. (e.g. is it worse to breach the primariesPerIndexPerNode constraint, or totalPrimariesPerNode constraint?) If only a single primary had recoveries, the former can be bad. If multiple primaries have recoveries, and target shard for 2 different indexes is on the same target node, the latter can be worse. The point is, both are bad, and engineering around which is worse adds complexity with diminishing returns.

...

Consider this with an example, with avg primary shards count per node as 2 and two contender nodes (N1, N2) where N1 contains 3 primary shards (breach threshold by 1) while N2 contains 5 primary shards (breach threshold by 3), here N2 should not be preferred.

Constraint framework today is simple - it adds numConstraintsBreached * CONSTRAINT_WEIGHT to the weight returned by default weight function. In the example above, N2 will already have a high value from default weight function. So even with the same CONSTRAINT_WEIGHT values, it will not be preferred.

That is the central idea. Let a simple default weight function do its work. Then all eligible nodes with no constraints breached are better than ones with 1 constraint breached, who are better than nodes with 2 constraints breached and so on.. This works in tandem with the "Allocation Decider" Framework; which already rules out nodes which should not even be considered for allocation or rebalance. Deciders filter out ineligible nodes. Constraints deprioritize less preferred nodes.

Finally, I think it makes sense to extend the framework to handle rebalances. I did play around with it at the time, but threw it away, since allocation was the only problem we were solving then. Looks like now is a good time to add it. If you add constraint handling to rebalances, would it take care of the extra rebalances you pointed out @dreamer-89 ?

Re: imbalances that emerge from replica -> primary promotions.. I think you need the pri-rep swap capability to truly be able to solve it. There is no other way (I know of), to solve that problem for a 2 node cluster.

Thank you @vigyasharma for the feedback.

Consider this with an example, with avg primary shards count per node as 2 and two contender nodes (N1, N2) where N1 contains 3 primary shards (breach threshold by 1) while N2 contains 5 primary shards (breach threshold by 3), here N2 should not be preferred.

Constraint framework today is simple - it adds numConstraintsBreached * CONSTRAINT_WEIGHT to the weight returned by default weight function. In the example above, N2 will already have a high value from default weight function. So even with the same CONSTRAINT_WEIGHT values, it will not be preferred.

Sorry, I was not clear enough in the example. I meant to have a state where N1, N2 have unequal primary shard count but still balanced on weight calculation. Updated the example and copy pasted below as well. Overall, I agree, let's keep the constraint as predicate and not change until it is really needed.

Existing constraints are predicates which return yes/no based on whether node has breached constraint or not. It would be better if constraint returns a degree by which a constraint was breached. The quantity can then be incorporated into node's weight calculation. Consider this with example of 4 node (N1, N2, N3, N4) and 1 index (8 shard, 1 replica) with 2 as C1 value. Nodes N3, N4 are excluded from allocation. Node N1 contains 3 primary 5 replica (breach primary threshold by 1) while Node N2 contains 5 primary 3 replica (breach threshold by 3). Here, as both nodes have unequal primary and breaching C1 and also they are balanced (equal weight because all shards belong to same index and both nodes have same number of shards). For shard allocation (increase replica count) and rebalancing, both nodes can be picked as target while it is prefera

That is the central idea. Let a simple default weight function do its work. Then all eligible nodes with no constraints breached are better than ones with 1 constraint breached, who are better than nodes with 2 constraints breached and so on.. This works in tandem with the "Allocation Decider" Framework; which already rules out nodes which should not even be considered for allocation or rebalance. Deciders filter out ineligible nodes. Constraints deprioritize less preferred nodes.

Yes, I agree. Also, defining relative priority for multiple constraints will be difficult and need more brainstorming.

Finally, I think it makes sense to extend the framework to handle rebalances. I did play around with it at the time, but threw it away, since allocation was the only problem we were solving then. Looks like now is a good time to add it. If you add constraint handling to rebalances, would it take care of the extra rebalances you pointed out @dreamer-89 ?

Yes, I added the constraint in weight calculation during rebalancing. Sorry, I was not clear. With extra rebalances - I meant extra relocations compared to existing world where primary and replicas are equal. With proposal here, we need to move primary to attain primary balance followed by another relocation such that weight diff b/w any two nodes is within threshold (default 1.0f).

Re: imbalances that emerge from replica -> primary promotions.. I think you need the pri-rep swap capability to truly be able to solve it. There is no other way (I know of), to solve that problem for a 2 node cluster.

Yes agree. Created https://github.com/opensearch-project/OpenSearch/issues/6481 to track the discussion.

opensearch-project / OpenSearch