[RFC] In-place Shard Splitting

vikasvb90 commented 6 months ago

Problem Statement

In search use cases where entire data is supposed to be hot i.e. available for search throughout their lifetime, it is hard to predict the no. of primary shards in advance. As a result, the OpenSearch cluster can potentially get into hot shard or large shard problem where storage capacity of the node on which shard is hosted becomes insufficient to host the shard or Lucene’s hard limit of number of docs in a lucene index kicks in. This generally happens in use cases where custom routing keys are used. Today, there are two options available for users to get out of this problem - reindex and _split API. With reindexing, entire data of the index is reindexed into a new index having more number of primaries which is known to be a very slow process and requires huge amount of additional compute and IO for search and indexing operations to re-index. With _split API, the index is first marked as read-only and then a split of all shards of the index is performed which causes write downtime for users. Also, _split API doesn’t provide the granularity of splitting at shard level. So, a single hot shard cannot be scaled out independently. In addition to this, free disk space equal to the entire size of the index is required till the old index is deleted. Hence, both approaches have their own limitations. This proposal looks at scaling of shards holistically and proposes options to address the problems above.

Requirements

Functional

In-place split should support splitting a single shard as well as splitting an entire index.
Index should continue to receive traffic while splitting happens in the background.
There shouldn’t be any data loss during, before and after the split.
Custom routings or routing partitions cases should not get impacted.

Non Functional

Split should have a minimal availability impact on the on-going traffic.
Search and indexing latencies and throughput on child shards should eventually be at par with latencies and throughput of any new index created with same data and same number of shards.

Proposals

Note: Options below are just proposals and separate issue would be shared with the detailed high level design.

Option 1 - Shard Key Space Expansion

Each shard in OpenSearch is responsible for serving documents whose routing ids hash to the shard’s keyspace. In this approach, a new writable empty shard is created once threshold set against the shard is breached and old shard of the same key space is made read/update/delete only. New documents get ingested into new shard which can be hosted on same or different node as per the allocation policy. Searches for the shard’s keyspace are routed to all shards present in the keyspace. This also includes get by id, update by id and delete by id for which search queries need to be executed first to find out docs belonging to one or more shards in the keyspace followed by the respective id based operation. Since, shard skew problem is generally caused by custom doc ids, bulk calls containing custom doc ids would translate to search followed by bulk since doc ids first need to fetched to distinguish the operation as an update or an insert. To accommodate updates or deletes on data present in old shard, an additional buffer storage required for deletion and merges of any or all docs of the shard is reserved.

Pros

Time to auto-scale: Since this approach doesn't mutate existing shard and a new empty shard is brought up in the same key space, scale in and scale out is considerably faster.
Throughput based Scaling: Although this isn’t a major requirement of this RFC, scaling for throughput is better handled here since it is faster to scale in/out since shards or not physically split.
Aggressive Scaling: Scaling to high number of shards aggressively doesn't have any significant impact.

Cons

Impact on search latencies: Life long impact of increased search latencies during concurrent execution of id based operations because of the fact that all id based requests are routed to all shards against being routed to single shard. With growth in number of shards, latencies increase drastically. With 2 shards search latencies are 20% higher when concurrent id based ops are being executed, with 4 shards 75%, with 8 shards 200% and with 16 shards 150%.

comparison

Impact on indexing: Since hot shard problem is prevalent mostly in cases where custom ids or routings are used, there would be a lifelong impact on indexing throughput and latencies since all bulks ops would require an additional doc fetch before ingestion. There would also be a considerable impact on id based updates or deletes.

api_comparison

Handling of Lucene's 2 billion doc limit breach: Since existing shards are not mutated, no further upserts can happen once it reaches 2 billion doc count limit (including deleted docs count).
Custom Routing is broken: With shard expansion, users will lose control of custom routing which was intended for better search latencies. There are users who use routing partitions to route tenant specific workload to only a subset of shards. This approach can't support these use cases.
Code complexity: This would require multiple operations to be built from scratch. E.g., converting all id operations to query based operations, bulk calls translation to search+bulk with if-term and if-seq calls, etc.
Long term maintenance: Because of almost no reusable piece in the new flows, considerable maintenance overhead can be expected.

[Preferred] Option 2 - In-Place Physical Shard Split

In this approach, shard data is split into child shards depending on the child hash ranges. There are two parts of this design

Hash range assignments and conversion of existing routing to work with shard ranges by ensuring backward compatibility.
In-Place Shard Splitting: OpenSearch offers splitting a write blocked index. Essentially, existing index remains unchanged and a new index with a higher number of shards is created. In physical split approach, goal is to provide a capability to split one or more shards of the index without blocking ingestion on the index. Please note that this approach does not intend to re-invent shard split as it already covers all routing use cases. It intends to address the limitations of _split API discussed above. In this approach, existing ShardSplitttingQuery can be re-used to partition docs of the parent shard. Exact high level details would be shared in the design doc but at high level, for child shards recovery, existing peer recovery of a target primary from a source primary can be leveraged for the recovery of child shards where existing shard split query can be re-used in place of segments copy.

Pros

Impact on search latencies: During the split there may be some impact on search latencies but impact won’t be life long.
Impact on indexing: During the split there may be some impact on indexing latencies but impact won’t be life long.
Handling of Lucene's 2 billion doc limit breach: Shard split can solve this problem since split shard will now have less (or almost half doc count) to accommodate docs.
Custom Routing: Users can continue to use custom routings and routing partitions to better control their search traffic.
Code complexity: At this point, it seems like this approach would have lesser code complexity as some of the building blocks of this approach like shard split query, peer recovery of a primary shard, etc. can be reused with some modifications.
Long term maintenance: Most of the development would be on top of existing code blocks and hence, maintenance overhead would be a lot lesser.

Cons

Time to auto-scale: Segments of existing shard are split into child shards, hence, time to auto-scale is proportional to the size of shard.
Throughput Scaling: Shard split can result in small shards which can impact search latencies.
Aggressive Scaling: Scaling aggressively can lead to aggressive resource consumption. Free disk space of similar order would be required and merging of deleted docs would lead to aggressive CPU consumption

Offline _split API vs In-Place physical shard split

High level View of Split process: In OpenSearch, _split API requires an index to be write blocked. During split, segments of latest commit of source index are hard linked into the directory of new index, And, a delete by query is executed via ShardSplitttingQuery on shards of the new index to remove documents which do not belong to the respective shards. In-place split can use ShardSplitttingQuery and build an in-place recovery approach on top to recover child shards against the source shard.
Routing changes: Since offline split creates a new index altogether and since all shards are split, keyspace distribution remains uniform but with in-place shard split, key space distribution would be non-uniform. To handle this, we can build a partition based routing logic where partitions as well as their ranges need to be maintained. A brute force approach to select the shard would be to continue to do the mod of routing hash with number of partitions and if we find a split partition then we iterate and select the child shard based on its range. Optimized approach like binary search and other options can be further explored. More low level details will be shared in the detailed design.

In-place Split API

We have 2 options for this:

Enhance existing _split API with more options
Create a new in place split API

The preference would be to go with Option 2 and create a new API. There are inherently different where current _split creates a new index vs new api will perform in-place split of a shard of an index. So, keeping them under single API will make it confusing. Please share your opinion on the preferred option for in-place split API.

POC conducted for in-place physical shard split

Following are some of the observations based on the benchmark carried out:

Doesn’t have a lifelong impact on search latencies. 40% impact on search latencies for a 2 core machine and 100gb shard till deleted docs are merged away was observed though. After this, latencies become normal.
Positive impact on indexing with horizontal scaling of shards after split. For a 100gb shard split into 2 shards, 100% higher indexing rate was observed on a 2 core machine. Although, till deleted docs are merged away after split and at peak indexing traffic, a 50% drop in indexing rate was observed.
With a 100gb shard split, 2 core machine and max CPU utilization, it would take around 2.5-3 hours for complete clean up of deleted docs.

How Can You Help?

Any general comments about the overall direction are welcome. At this point we believe that option 2 fits more into solving the listed problems. Some specific questions:

What are your use cases where this architecture would be useful?
Any other problem or bottleneck you have faced which align with goal of this proposal?

Next Steps

We will incorporate the feedback from this RFC into a more detailed proposal/high level design. We will then create meta issues to go in more depth on the components involved to continue the detailed design.

Appendix

Latencies and throughput of In-Place split

Latencies and throughput of new child shards created in in-place split will eventually be at par with latencies and throughput of new index because of the fact that split performs delete by query in the background to remove documents from child shards which do not belong to them and due to rise in the number of deleted documents on shards, impact on search latency and hence, on throughput can be seen. Once deleted documents are merged away by the tiered merge policy, search latencies come back to normal state.

Routing Logic

In OpenSearch, routing of a document having a custom id happens via a 32-bit murmur3 hash function which generates a 32 bit hash which is then used to compute the shard id by using the below logic:

private static int calculateScaledShardId(IndexMetadata indexMetadata, String effectiveRouting, int partitionOffset) {
        final int hash = Murmur3HashFunction.hash(effectiveRouting) + partitionOffset;

        // we don't use IMD#getNumberOfShards since the index might have been shrunk such that we need to use the size
        // of original index to hash documents
        return Math.floorMod(hash, indexMetadata.getRoutingNumShards()) / indexMetadata.getRoutingFactor();
 }

Benchmark Setup

Used 3 r5.2xlarge instances.
Index with 24 shards and 1 replica. Number of clients used = 8.
Downloaded 1.4gb document json file of http_logs workload consisting 11961342 docs.
Created a custom workload with data of following type which had a custom_id field which was later used to perform update_by_query benchmarks. Value of field custom_id was kept same as _id field. During workload creation, value of this field was set to 1 as follows:
```
{"@timestamp": 896644802, "custom_id":"1", "clientip":"61.185.11.0", "request": "GET /images/logo_cfo.gif HTTP/1.1", "status": 200, "size": 1504}
```

Executed a custom update_by_id workload having entries as follows to update custom_id field.

{"update": {"_id": "0000000000", "_index" : "http_logs_index_2" } }
{"script":{"source":"ctx._source.custom_id = params.custom_id","lang":"painless","params":{"custom_id":"0000000000"}}}
{"update": {"_id": "0000000001", "_index" : "http_logs_index_2" } }
{"script":{"source":"ctx._source.custom_id = params.custom_id","lang":"painless","params":{"custom_id":"0000000001"}}}

Custom workload to update custom_id field was only executed for 1000001 docs. This was done to prevent OSB eventually running into OOM error which happens when it reaches a certain threshold of requests. This is probably happening due to build up of large number of stats proportional to requests in memory.
At the end of the run, document count where custom_id wasn’t equal to 1 was 1000001 as we intended.

Deleted remaining docs by performing a delete_by_query on docs where custom_id was not 1.

health  status  index                        uuid  pri  rep  docs.count  docs.deleted  store.size  pri.store.size
green  open     http_logs_index_2  b12aBnc7S32T4NSginLfFQ  1  1  1000001  0  185.4mb  91.8mb

As a result, an index was produced which consisted of doc of following type where _id field could be used to benchmark update_by_id use case and custom_id could be used to benchmark update_by_query use case.

GET http_logs_index_2/_doc/0001000000
{"_index":"http_logs_index_2","_id":"0001000000","_version":2,"_seq_no":12951765,"_primary_term":1,"found":true,"_source":{"@timestamp":896737453,"custom_id":"0001000000","clientip":"194.175.9.0","request":"GET /english/help/image/site_hd.gif HTTP/1.0","status":200,"size":960}}

Benchmark Results

Impact on search latencies due to the overhead of deleted docs after split

Cluster configuration : 3 m6g.large (2 vCPUs) nodes non-dedicated setup having 500gb disk space each. This benchmark uses OpenSearch’s existing offline index split to measure the impact on search.

Benchmark before split
1. Configurations
  1. Index with 1 shard 0 replica of size 115gb.
  2. No additional optimization.
  3. Search rate 1800/min.
  4. Indexing rate 1.13mil.
2. Search latency hovers between 0.08ms to 1.12ms.
3. Graphs:
  1. 24-Dec 08:30 to 9:48 IST Indexing+search
Benchmark after split
1. Configurations
  1. Target index with 2p 0r.
  2. No additional optimization.
  3. Search rate 3600/min. This is in accordance with previous configuration since 2 shards would receive search traffic in this case vs 1 shard.
  4. Indexing rate 1.13mil.
2. There is almost 40%-50% impact on search latencies hovering between 0.1ms to 0.17ms
3. No change in FreeStorageSpace since segment files are hard linked. but since a new index of equal size is created, we can expect an equal drop once deleted docs are removed permanently and new segments are created.
4. Index split API takes approx 1.15min for a 115gb shard split. Tested with multiple split calls.
5. Graphs
  1. Shard is split on 12/25 10:32 IST.
  2. Split index API takes almost 1min to mark deleted docs, create shards, initiate recovery and return. At 10:33 IST, new shards are created.
  3. At 10:35 IST, shards of new index get started.
  7. This setup results into max CPU and disk IO utilization and hence, indexing rate drops by 50% since latency becomes 2x after split. Note that this is a single shard index setup and hence, impact on indexing is huge but in practical scenarios where an index has multiple primaries, latency impact should be considerably lower.
  8. Both search and indexing latencies become normal once deleted docs count drops to a number (<20%), beyond which segments no longer become eligible for merge by default merge policy.

Thanks a lot @shwetathareja for the valuable insights!

gashutos commented 6 months ago

Great porposal @vikasvb90 !

Impact on search latencies: Life long impact of increased search latencies during concurrent execution of id based operations because of the fact that all id based requests are routed to all shards against being routed to single shard. With growth in number of shards, latencies increase drastically. With 2 shards search latencies are 20% higher when concurrent id based ops are being executed, with 4 shards 75%, with 8 shards 200% and with 16 shards 150%.

Curious, with higher number of shards we see better search performance becuase of concurrency, how come is it different here ?

Handling of Lucene's 2 billion doc limit breach: Since existing shards are not mutated, no further upserts can happen once it reaches 2 billion doc count limit (including deleted docs count).

This is per segment level right ? Not per shard.

On a high level, both of the approaches will re-index the unlying lucene data structure. How is it different than doing below ?

Create a new index with higher number of shards
Write block on previous index and route newly indexed data to new index.
Re-index old index to newer index (in parallel to writes)

Now in step -3, we need some modification in re-index API to create newer segment non-searchable until re-index finishes. Probably we can just copy the segments modifying their segment info with new shard details, we can save re-indexing the data as well here doing like this.

vikasvb90 commented 6 months ago

Curious, with higher number of shards we see better search performance becuase of concurrency, how come is it different here

In shard expansion approach, since document ids are not visible to customers, search requests are routed to all shards instead of just one shard because coordinator won't be able to determine right shard based on the doc id. Hence, significantly more IO and compute leading to higher search latencies.

This is per segment level right ? Not per shard.

No, this limit is imposed at Lucene Index i.e. an index shard in OS terms.

On a high level, both of the approaches will re-index the unlying lucene data structure.

Not true. Re-indexing isn't part of any of the proposed options above. In in-place physical split, no re-indexing is done and as I mentioned in section High level View of Split process, delete by query is executed on segments of latest commit of the shard. And in shard expansion, en empty shard is brought up in the same shard key space and old shard data isn't mutated.

mgodwan commented 6 months ago

Will it be helpful to see if making segments grouped on the routing partition which a shard caters (e.g. if a shard caters to 0-128, we can have 4 groups of {[0,32),[32,64), [64,96), [96,128)} which will make the first log(num_groups) i.e. 2 in this example splits very efficient on shards with respect to handling deletions of documents after split as it will purge the entire segments? New segments created within the child shard can continue to apply this invariant with respect to the shard routing range to ensure that splits are optimized for at least half the data as well going forward

@RS146BIJAY has an RFC around grouping the data in segments https://github.com/opensearch-project/OpenSearch/issues/13183 which we are discussing and was thinking how applicable it may be for such scenarios (but seems to optimize data storage for supporting splits)

vikasvb90 commented 6 months ago

@mgodwan I agree. Ability to group segments based on partitions would be quite helpful as this will significantly minimize the impact of split due to deleted documents. Due to the impact of deleted docs, split operation may need to be planned but with this feature split can even be carried out based on signals and dynamic scaling solutions can be built on top.

msfroh commented 5 months ago

Forking in-place on a live index sounds extremely challenging if you want to do it with a single IndexWriter. Essentially, you would need to maintain two distinct index commit histories, which Lucene makes very difficult. (I initially tried a forked merge history for merge on commit.) Would the idea be to start a new smaller shard (or pair of smaller shards) and write updates to both the "big" shard and the smaller shard, then copy over segments from the big shard? (Using the addIndexes API on the small shard using a FilterCodec that hides the "other" documents could copy things over pretty quickly and avoid producing a bunch of deletes that need to get merged away.)

The approach that I've implemented in the past (12 years ago, we built a Lucene-based search service internal to Amazon that supported live resharding) essentially involved an upstream transaction log (conceptually similar to Kafka or Kinesis). We would spin up two smaller shards (each covering half of the hash range of the earlier shard) that would restore a snapshot of the big shard, each delete half the docs, then they'd get caught up on the "live" updates by reading from the translog.

vikasvb90 commented 5 months ago

@msfroh Thanks for the comment! Idea is not to use the same IndexWriter but it is actually along the same lines of the approach you used for live-resharding. Instead of restore, segments of source shard will be hardlinked in the directories of child shards, respective IndexWriters will be opened, documents not belonging to the child shards will be deleted (to be merged away later) and then we will perform a recovery based on peer-recovery flow to bring child shards in-sync with parent. Shard split mechanism is already present in OpenSearch - ShardSplittingQuery. Most of the work will be around online recovery of split shards i.e., while handling live traffic. I am almost done with the design and going to publish it in a day or two. I will look forward to hear your thoughts on the same.

opensearch-project / OpenSearch