Introduction

In RFC https://github.com/opensearch-project/OpenSearch/issues/12918, we discussed potential approaches for splitting of a shard in-place and requested feedback on the same as well as use cases which may need to be additionally addressed while designing In-Place shard split. We published the design for splitting a shard in-place in issue https://github.com/opensearch-project/OpenSearch/issues/13923 where we raised the need for devising a mechanism to route operations on child shards and explained the operation routing briefly. In this issue, we will talk about how operations are routed in OpenSearch, delve into different operation routing approaches for in-place shard splitting, discuss their pros and cons and finally add benchmark results comparing the performance of each approach.

Operation Routing In OpenSearch

Currently, document routing algorithm in OpenSearch is governed by the following formulae

routing\_factor(RF) = \frac{num\_routing\_shards(RS)}{num\_primary\_shards(P)}$$

shard\_num = \frac{hash(\_routing) \% num\_routing\_shards(RS)}{routing\_factor(RF)}$$

Here,shard_num is effective routing value which is derived from document’s _id which is termed here as _routing. Custom routing patterns can be implemented by specifying a custom routing value per document.

Let’s try to understand the algorithm visually

Our hash function i.e. murmur hash has a practical key range from $0$ to $2^{31}−1$ but ideally we restrict the range to 1024 by default i.e. the allowable max number of shards on a single node so that if we divide our hash space to 1024 parts, each part can be assigned to a single shard.

When we create P primary shards, document hash space needs to be divided into equal parts. The default upper limit of max routing number of shards i.e. 1024 may not be divisible by P which could result in unequal hash space distribution. The value 1024 is nothing but the default number of partitions of the document hash space which is frozen during index creation. A shard of an index is assigned a subset of partitions. For e.g. if an index is created with 8 shards then each shard consists of 128 partitions. Any scaling of shards from x to y in resize actions like shrink index and split index cases will happen based on the number of partitions. For uniform distribution of hash space, all shards should have equal number of partitions otherwise it would result in a collision of keys according to pigeonhole principle. Therefore, number of partitions would be the highest multiple of P less than or equal to 1024. For e.g., for an index consisting of 5 shards, number of partitions assigned to each shard would be 204. This number is called as the number of number of routing shards of an index. In resize action of OpenSearch like splitting an index where a new index is created and recovered from an existing index, the number of partitions are recalculated. As we saw earlier that number of partitions should always be equally distributed among shards, number of shards of the new index can only be a number exactly divisible by number of shards of the existing index. In addition, new number of shards should be less than the routing number of shards of existing index. Mathematical relation between at most n splits of a shard of an index having P primary shards and upper limit of routing shards equal to 1024 can be expressed as $2^n*P < 1024$.

Similarly, number of partitions assigned on each shard can be expressed as: $num\_routing\_shards(RS)/num\_primary\_shards(P) = (2^n*P)/P = 2^n$

This is called $routing\_factor(RF) = 2^n$

Once, we have RS and RF then its easy to figure out the shard number by figuring out the shard in which the hash of routing id should lie in, given by $shard\_id = \frac{hash(\_routing) \% RS}{RF}$

Describe the solution you'd like

Operation routing design requirements for in-place shard-split:

There shouldn’t be any data loss before, during or after a shard is split
The key space or the hash range which parent shard serves should be split/scaled in a way that it is uniformly distributed among new shard(s) and division should be mutually exclusive

Approaches

We will go over two approaches and discuss how each of them can effectively route an operation on child shards.

Approach 1 - Routing using Shard Hash space [Recommended]

In this approach each seed shard(shards present at the time of index creation) is assigned a hash space. We are using murmurhash3_x86_32 algorithm to calculate hash of _routing before deducing the respective routing shard. It means that absolute hash values could only lie within 0x00000000 - 0xffffffff range (signed int). Therefore, each seed shard is assigned a hash space from 0 to 2^32 -1 . Whenever a shard split happens its hash space gets divided equally between the child shards. Let’s consider a mental model where shards are arranged according to the start and end values of acceptable hash values.

If split 1:4 is performed on a primary shard , upcoming key space partition will look like:

In this case, all docs getting hash between 0x00000000 - 0x3FFFFFFF will get route to child shard 0, hash between 0x40000000 - 0x7FFFFFFF will get to child shard 1 and so on.

We can further split any of the child shard for e.g. say we split the child shard 2 above into two shards then hash space partition will look like:

In this further split, child shard 2 will cease to exist to make way to shard 4 and 5 which will serve ranges 0x80000000 - 0x9FFFFFFF and 0xAFFFFFFF - 0xBFFFFFFF respectively. Information of these “high” and “low” values needs to be saved in the cluster state.
Note that high value of key space of each primary shard is always increasing, so, one optimization we'll perform is binary searching over these values to efficiently get the routing_shard. But, the necessary information needs to be pre-computed and stored in the cluster state as well otherwise benefits of binary search would gravely out-weight construction.

Mathematically routing algorithm can be defined as follows:

$routing\_factor(RF) = \frac{num\_routing\_shards(RS)}{num\_primary\_shards(P)}$ $seed\_shard\_id = \frac{hash(\_routing) \% num\_routing\_shards(RS)}{routing\_factor(RF)}$

Note: Each seed shard maintains its own hash space and the distribution of the hash space in child shards

A hash space is defined as $Range_i = [l_i, ri)$ Each shard has its own hash range and they are comparable i.e. $Range_i < Range_j$ <=> $l_i < l_j$ $seed\_shard\_ranges = Sorted{Range{child\_shard1},Range{child\_shard_2}, ... }$

Objective: To find the hash range which contains our hash value. Since, shard ranges are sorted we can find a range with starting point just less than or equal to hash value by using binary search. We define a function: $ceil({x_1, x_2, x_3 ....}, x)$ which outputs index i such that value $x_i \in {x_1, x_2, x_3, ...}$ just less than equal to $x$ and is based on binary search with a complexity of $O(log_2(len({x_1, x_2, x_3,....})))$

shard\_id = ceil(\{ Range_{child\_shard_1}, Range_{child\_shard_2}, ... \}, hash(\_routing))

Advantages:

Does not limit the number of times a shard can be split in an existing index
Easier to understand and maintain
Faster if depth of split is higher than logarithmic base 2 of number of shards. For e.g., if a shard gets split 3 times and at each level 2 child shards are produced then this approach will perform better as compared to the previous approach since logarithmic base 2 of 4 shards i.e. 2 is lesser than the depth i.e. 3.

Disadvantages:

Performs slow if depth of splits is lower than logarithmic base 2 of number of shards.

Implementation details:

In Opensearch we'll maintain the keyspace for each shard as SplitMetadata which keeps a map of shard id and range assigned to the specific shard. SplitMetadata also contains a flat sorted set of keyspace ranges assigned to each split shard per seed shard which makes the lookup of hash to shard id it belongs a much efficient operation by using binary search. We keep track of only those shards which have been split or are created as a result of split in SplitMetadata and therefore in case of no split there’s no overhead increase in index metadata. During shard split i.e. when recovery is happening we don’t keep child shard keyspace ranges in the sorted set but maintain them in an ephemeral list of keyspace ranges in the parent shard metadata. Once, the recovery is complete and shard split is marked completed we remove the ephemeral child shard IDs and add it to our maintained sorted set for routing of documents.

Approach 2 - “Recursive Routing ”

Based on our current routing algorithm in opensearch, routing terms are expressed as follows:

$P = num\_initial\_primary\_shards$ $n={floor(log_2(1024/P))}$ $RF = 2^n$ $RS=RF*P$ $hash\_value = hash(\_routing)$

Consider, shard $p_i$ gets split into $P_i$ shards, we’ll calculate new routing parameters i.e. RS & RF for the child shards

$RF_i = 2^{n - log_2(P_i)}$ $RS_i = {RF_i}*P_i$

Now, to route the document, we’ll first find out which shard out of initial shards the doc will get routed to using the following equation. Assume, the doc got routed to shard $p_i$ then

$p_i = \frac{{hash\_value}\%{RS}}{RF}$

Since, shard $p_i$ has been split into $P_i$ shards, we’ll use the new routing parameters to calculate the actual shard($p_ij$) to which the doc will gets routed

$p_{ij} = \frac{hash\_value\%RS_i}{RF_i}$

This pattern will continue for further splits. Say, the shard $p_ij$ gets split into $P_j$ shards then we’ll get the new routing parameters as:

$RF_j = 2^{floor(log_2(\frac{RF_i}{P_j}))}$ $RS_j = {RF_j}*P_j$

At runtime the doc will get routed to shard $p_{ijk}$

$p_{ijk} = \frac{hash\_value\%RS_j}{RF_j}$

Base conditions: if $RF_k==1$ for any shard $p_k$ then we won’t allow further splits for that shard Also, if some shard $p_k$ is the terminal shard i.e. it hasn’t been split then the above described routing algorithm will terminate at that shard

High level visual representation In the following figure, partitions are getting distributed evenly and we’re maintaining the routing parameters at each level.

Advantages:

Faster in cases where depth of splits is lesser than logarithmic base 2 of number of shards.

Disadvantages

Limit the number of times a shard can be split in an existing index which was created before the OS version which will support in-place shard split at 1024 i.e. default value of routing shards
Performs slow if depth of splits is higher than logarithmic base 2 of number of shards.
Quite complex to understand and maintain.

Benchmarks

Setup:

HostType: m4.2xlarge OS: Amazon Linux 2 x86_64 region: us-west-2 vCPUs: 8 Memory: 32Gb Architecture: x86_64

Parameters:

Number of seed shards: 1 Number of splits per shard: 2 Depth i.e. total number of split operations performed per seed shard - 1 : [0,10) Number of documented routed(numDocs): [0,100000] Benchmark Mode: avgt i.e. average time taken per routing operation in nanoseconds Number of threads used: 1

Approach 1 → DocRoutingBenchmark.routeDocsRecurring Approach 2 → DocRoutingBenchmark.routeDocsRange

On a high level we split a single shard repeatedly into two and compare latency degradation at different depths which increases with number of splits

Benchmark numbers for Approach 1 Vs Approach 2

|Benchmark |depth|numDocs|Mode|Score |Units| |--------------------------------------|-----|-------|----|----------|-----| |DocRoutingBenchmark.routeDocsRange |0 |1 |avgt|484.304 |ns/op| |DocRoutingBenchmark.routeDocsRange |0 |100 |avgt|6922.807 |ns/op| |DocRoutingBenchmark.routeDocsRange |0 |1000 |avgt|74009.181 |ns/op| |DocRoutingBenchmark.routeDocsRange |0 |10000 |avgt|763362.939|ns/op| |DocRoutingBenchmark.routeDocsRange |0 |100000 |avgt|8378460.97|ns/op| |DocRoutingBenchmark.routeDocsRange |1 |1 |avgt|496.923 |ns/op| |DocRoutingBenchmark.routeDocsRange |1 |100 |avgt|7432.15 |ns/op| |DocRoutingBenchmark.routeDocsRange |1 |1000 |avgt|82045.71 |ns/op| |DocRoutingBenchmark.routeDocsRange |1 |10000 |avgt|888808.969|ns/op| |DocRoutingBenchmark.routeDocsRange |1 |100000 |avgt|9252085.75|ns/op| |DocRoutingBenchmark.routeDocsRange |2 |1 |avgt|488.046 |ns/op| |DocRoutingBenchmark.routeDocsRange |2 |100 |avgt|7348.704 |ns/op| |DocRoutingBenchmark.routeDocsRange |2 |1000 |avgt|86551.268 |ns/op| |DocRoutingBenchmark.routeDocsRange |2 |10000 |avgt|919804.381|ns/op| |DocRoutingBenchmark.routeDocsRange |2 |100000 |avgt|9889578.54|ns/op| |DocRoutingBenchmark.routeDocsRange |3 |1 |avgt|488.343 |ns/op| |DocRoutingBenchmark.routeDocsRange |3 |100 |avgt|7621.063 |ns/op| |DocRoutingBenchmark.routeDocsRange |3 |1000 |avgt|95474.99 |ns/op| |DocRoutingBenchmark.routeDocsRange |3 |10000 |avgt|1025523.67|ns/op| |DocRoutingBenchmark.routeDocsRange |3 |100000 |avgt|10797462.1|ns/op| |DocRoutingBenchmark.routeDocsRange |4 |1 |avgt|494.31 |ns/op| |DocRoutingBenchmark.routeDocsRange |4 |100 |avgt|8065.714 |ns/op| |DocRoutingBenchmark.routeDocsRange |4 |1000 |avgt|103151.209|ns/op| |DocRoutingBenchmark.routeDocsRange |4 |10000 |avgt|1099497.06|ns/op| |DocRoutingBenchmark.routeDocsRange |4 |100000 |avgt|11711143.9|ns/op| |DocRoutingBenchmark.routeDocsRange |5 |1 |avgt|488.817 |ns/op| |DocRoutingBenchmark.routeDocsRange |5 |100 |avgt|8279.029 |ns/op| |DocRoutingBenchmark.routeDocsRange |5 |1000 |avgt|109389.782|ns/op| |DocRoutingBenchmark.routeDocsRange |5 |10000 |avgt|1183312.2 |ns/op| |DocRoutingBenchmark.routeDocsRange |5 |100000 |avgt|12082835.5|ns/op| |DocRoutingBenchmark.routeDocsRange |6 |1 |avgt|502.501 |ns/op| |DocRoutingBenchmark.routeDocsRange |6 |100 |avgt|8583.495 |ns/op| |DocRoutingBenchmark.routeDocsRange |6 |1000 |avgt|118236.136|ns/op| |DocRoutingBenchmark.routeDocsRange |6 |10000 |avgt|1252016.38|ns/op| |DocRoutingBenchmark.routeDocsRange |6 |100000 |avgt|13199143.9|ns/op| |DocRoutingBenchmark.routeDocsRange |7 |1 |avgt|497.333 |ns/op| |DocRoutingBenchmark.routeDocsRange |7 |100 |avgt|9107.973 |ns/op| |DocRoutingBenchmark.routeDocsRange |7 |1000 |avgt|125925.531|ns/op| |DocRoutingBenchmark.routeDocsRange |7 |10000 |avgt|1330136.39|ns/op| |DocRoutingBenchmark.routeDocsRange |7 |100000 |avgt|13872079.1|ns/op| |DocRoutingBenchmark.routeDocsRange |8 |1 |avgt|508.262 |ns/op| |DocRoutingBenchmark.routeDocsRange |8 |100 |avgt|9511.286 |ns/op| |DocRoutingBenchmark.routeDocsRange |8 |1000 |avgt|139429.952|ns/op| |DocRoutingBenchmark.routeDocsRange |8 |10000 |avgt|1453068.55|ns/op| |DocRoutingBenchmark.routeDocsRange |8 |100000 |avgt|15055794.7|ns/op| |DocRoutingBenchmark.routeDocsRange |9 |1 |avgt|510.916 |ns/op| |DocRoutingBenchmark.routeDocsRange |9 |100 |avgt|10152.875 |ns/op| |DocRoutingBenchmark.routeDocsRange |9 |1000 |avgt|150015.975|ns/op| |DocRoutingBenchmark.routeDocsRange |9 |10000 |avgt|1602086.52|ns/op| |DocRoutingBenchmark.routeDocsRange |9 |100000 |avgt|16859985.8|ns/op| |DocRoutingBenchmark.routeDocsRecurring|0 |1 |avgt|481.096 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|0 |100 |avgt|6915.725 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|0 |1000 |avgt|71412.345 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|0 |10000 |avgt|726282.631|ns/op| |DocRoutingBenchmark.routeDocsRecurring|0 |100000 |avgt|8129245.07|ns/op| |DocRoutingBenchmark.routeDocsRecurring|1 |1 |avgt|504.045 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|1 |100 |avgt|9180.867 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|1 |1000 |avgt|93815.905 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|1 |10000 |avgt|973699.552|ns/op| |DocRoutingBenchmark.routeDocsRecurring|1 |100000 |avgt|10224791.9|ns/op| |DocRoutingBenchmark.routeDocsRecurring|2 |1 |avgt|530.676 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|2 |100 |avgt|11601.343 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|2 |1000 |avgt|117090.87 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|2 |10000 |avgt|1199006.97|ns/op| |DocRoutingBenchmark.routeDocsRecurring|2 |100000 |avgt|12630158.5|ns/op| |DocRoutingBenchmark.routeDocsRecurring|3 |1 |avgt|554.57 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|3 |100 |avgt|14004.462 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|3 |1000 |avgt|141144.605|ns/op| |DocRoutingBenchmark.routeDocsRecurring|3 |10000 |avgt|1433050.55|ns/op| |DocRoutingBenchmark.routeDocsRecurring|3 |100000 |avgt|15022877.8|ns/op| |DocRoutingBenchmark.routeDocsRecurring|4 |1 |avgt|585.075 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|4 |100 |avgt|16373.323 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|4 |1000 |avgt|165582.599|ns/op| |DocRoutingBenchmark.routeDocsRecurring|4 |10000 |avgt|1682310.52|ns/op| |DocRoutingBenchmark.routeDocsRecurring|4 |100000 |avgt|17376692 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|5 |1 |avgt|602.198 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|5 |100 |avgt|18851.1 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|5 |1000 |avgt|191034.424|ns/op| |DocRoutingBenchmark.routeDocsRecurring|5 |10000 |avgt|1932252.88|ns/op| |DocRoutingBenchmark.routeDocsRecurring|5 |100000 |avgt|19961205.4|ns/op| |DocRoutingBenchmark.routeDocsRecurring|6 |1 |avgt|629.351 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|6 |100 |avgt|21364.155 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|6 |1000 |avgt|215418.63 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|6 |10000 |avgt|2187030.87|ns/op| |DocRoutingBenchmark.routeDocsRecurring|6 |100000 |avgt|22456628.9|ns/op| |DocRoutingBenchmark.routeDocsRecurring|7 |1 |avgt|650.714 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|7 |100 |avgt|23930.799 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|7 |1000 |avgt|241918.217|ns/op| |DocRoutingBenchmark.routeDocsRecurring|7 |10000 |avgt|2421512.91|ns/op| |DocRoutingBenchmark.routeDocsRecurring|7 |100000 |avgt|24830535.8|ns/op| |DocRoutingBenchmark.routeDocsRecurring|8 |1 |avgt|685.33 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|8 |100 |avgt|26572.623 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|8 |1000 |avgt|268137.903|ns/op| |DocRoutingBenchmark.routeDocsRecurring|8 |10000 |avgt|2706184.6 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|8 |100000 |avgt|27532240.3|ns/op| |DocRoutingBenchmark.routeDocsRecurring|9 |1 |avgt|705.721 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|9 |100 |avgt|29392.19 |ns/op| |DocRoutingBenchmark.routeDocsRecurring|9 |1000 |avgt|293422.558|ns/op| |DocRoutingBenchmark.routeDocsRecurring|9 |10000 |avgt|2952632.51|ns/op| |DocRoutingBenchmark.routeDocsRecurring|9 |100000 |avgt|30346605.2|ns/op|

Is hash space distribution uniform for child shards?

To assess the quality of distribution of keys among shards we define the hash distribution quality as

\sum_{j=0}^{m-1} \frac{p_j\left(p_j+1\right) / 2}{(n / 2 m)(n+2 m-1)}

where $p_j$ is the number of doc ids in j-th primary shard, m is the number of shards, and n is the total number of doc ids. The sum of $p_j(p_j + 1) / 2$ estimates the number of shards that needs to be visited to find a specific doc id. The denominator (n / 2m)(n + 2m − 1) is the number of visited slots for an ideal function that puts each doc id into a random shard. So, if the function is ideal, the formula should give 1. In reality, a good distribution is somewhere between 0.95 and 1.05.

Simple benchmark

Baseline: 5 primary shards with no split, 100k keys using existing opensearch routing Candidate 1: 5 primary shard split by a factor of 2, 100k keys using approach 1 Candidate 2: 5 primary shard split by a factor of 2, 100k keys using approach 2 hash quality of baseline: 1.02 hash quality of candidate 1: 1.03 hash quality of candidate 2: 1.03

All possible combinations of splitting of n=1 to n=2^6 shards showed that the average hash quality across all combination with 10k keys was 1.13 with approach 1 and 1.12 with approach 2 Hence, keyspace is getting distributed uniformly across the split shards with both the approaches!

Related component

Indexing:Replication

Thanks @vikasvb90 for all the help in refining the approaches as well as testing and benchmarking it

opensearch-project / OpenSearch

[RFC] Design Proposal routing algorithm for splitting shards In-place #13925