[RFC] An Adaptive merge policy with better Write and Space amplification

RS146BIJAY commented 11 months ago

Problem

Currently OpenSearch uses an implementation of Tiered Merge policy for compaction. Tiered Merge policy is suited for write intensive workload as it has lower write amplification. A disadvantage of Tiered Merge policy is its Read and Space amplification is high. For merging a set of segments it requires 3 times the total size of segments that are merged (as Compound files are enabled), increasing space amplification. Also obsolete entries (created when there is updates or deletion) are slowly removed from the segments in tiered merge policy (as compared to Level Merge policy), increasing read amplification.

What are other alternatives

Leveled Merge Policy

Leveled merge policy is used for read intensive workload. In Leveled Compaction as well data on disk are organized in multiple levels. The size of each level is maintained at T times that of the previous level (generally T = 10). A special level-0 (or L0 for short) contains files just flushed from in-memory write buffer (memtable). All non-0 levels have target sizes. Inside each level (except level 0), data is range partitioned into multiple SST files. The size targets are usually exponentially increasing.

Screenshot 2023-10-02 at 10 52 18 AM

Segment merge gets triggered when the data size of a level reaches its upper limit. In that case, segments at the current level will be pushed to and merged with the segments of next level. The resultant segments will be placed in the next level. Since merging at current level happens every time previous level gets filled up and a new set of segments is pushed to this level, read amplification (since obselete entry will be less) and space amplification (since only a small set of segment is pushed and merged from previous level) is less as compared to tiered merge policy.

Issue with this merge policy is write amplification is very high, so it is suited for read intensive workload.

Leveled Merge Policy with lazy leveling

Leveled Merge policy with lazy leveling improves write amplication for Leveled merge by eliminates merging at all but the largest level. Lazy leveling at its core is a hybrid of leveling and tiering: it applies leveling at the largest level and tiering at all other levels.

LazyLeveling

In order to keep the cost complexity of point lookups fixed despite having more segments to probe at smaller levels, we keep FPR at lower level better than FPR at larger level. Therefore, relative to leveling with lazy leveling, we are able to (1) improves the cost complexity of updates, (2) maintains the same complexity for point lookups, long range lookups, and space-amplification, and (3) provides competitive performance for short range lookups.

Worst case cost comparisons between merge policies for different operations

Operation	Tiered Compaction Worst Case Cost	Leveled Compaction Worst Case Cost	Lazy Levelling Compaction Worst Case Cost
Update	Worst-case workload is all updates target entries at the largest level. Number of merge operation per entry is O (L / B)	Worst-case workload is all updates target entries at the largest level. Number of merge operation per entry is O((T . L)/ B)	Worst-case workload is all updates target entries at the largest level. Number of merge operation per entry is O ((T + L) / B)
Point Lookups	Worst case scenario happens on zero-result point lookups. Expected point lookup cost is O(e^−M/N · L· T)	Worst case scenario happens on zero-result point lookups. Expected point lookup cost is O(e^−M/N · L)	Expected point lookup cost for zero point look up is e^{−M/N . ln (2)²} . T ^{T / (T - 1)} / (T - 1) ^{(T -1) / T}
Space-Amplification	Worst-case space-amplification occurs when entries at Levels 1 to L − 1 are all updates to different entries at Level L. Space amplification is O(1/T).	Worst-case space-amplification occurs when entries at Levels 1 to L − 1 are all updates to different entries at Level L. Space amplification is O(T).	Worst-case space-amplification occurs when entries at Levels 1 to L − 1 are all updates to different entries at Level L. Space amplification is O(1/T).
Range Lookups	Short Range look up cost is O(T · L). Long range cost is O((T·s) /B )	Short Range look up cost is O(L). Long range cost is O( s/B ).	Short Range look up cost is O((L − 1) · T). Long Range lookups cost is O( s/B ).

where,

L = number of levels T = size ratio between adjacent levels M = main memory allocated to the Bloom filters N = total number of entries s = selectivity of a range lookup B = number of entries that fit into a storage block P = size of the buffer in storage blocks

Analysis summary

As it can be inferred from the above analysis, no single merge policy dominates the others universally. Lazy leveling is best for combined workloads consisting of updates, point lookups and long range lookups, whereas tiering and leveling are best for workloads comprising mostly updates or mostly lookups, respectively.

Proposal - Fluid LSM merge policy

Fluid LSM is an adaptive merge policy that generalises all the above merge policies (Tired, Leveled and Lazy leveling). It enables switching between different merge policies as per the workload. It does this by controlling the frequency of merge operations separately for the largest level and for all other levels.

fluidLSM

In fluid LSM tree, there are at most Z runs at the largest level and at most K runs at each of the smaller levels. To maintain these bounds, every Level i has an active run into which we merge incoming runs from Level i − 1. Each active run has a size threshold with respect to the capacity of its level (total capacity comes out to T): T/K percent for Levels 1 to L − 1 and T/Z percent for Level L. When an active run reaches this threshold, we start a new active run at that level. Ultimately when a level is at capacity, all runs in it get merged and flushed down to the next level.

The bounds K and Z are used as tuning parameters that enable Fluid LSM-tree to assume the behaviors of different merge policies.

K = 1 and Z = 1 give leveling.
K = T − 1 and Z = T − 1 give tiering.
K = T − 1 and Z = 1 give Lazy Leveling.

Fluid LSM-tree can transition from Lazy Leveling to tiering by merging less frequently at the largest level by increasing Z, or it can transition to leveling by merging more frequently at all other levels by decreasing K.

Further steps

Create a POC to analysize how much performance improvent we can get using Fluid LSM. Planning to port Level Merge policy in OpenSearch to do this analysis.
If we can further extend this merge policy to adapt to system state (decrease merge frequency when CPU, JVM is high and vice versa).

Reference Paper

https://scholar.harvard.edu/files/stratos/files/dostoevskykv.pdf

RS146BIJAY commented 11 months ago

Would appreciate feedback from @shwetathareja, @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, @reta and @backslasht as well, so tagging more folks for visibility. Thanks!

nknize commented 11 months ago

...data is range partitioned into multiple SST files.

This is copy/paste from RocksDB? I'm assuming you mean multiple lucene "segments" instead of SST files?

...data on disk are organized in multiple levels.

Am I tracking correctly that Lucene segments at Li are only merged with segments at Li+1 once the cumulative data size of Li reaches it's upper limit?

nknize commented 11 months ago

It enables switching between different merge policies as per the workload.

I really like the idea of an adaptive merge policy! It falls in the category of "auto tuning" which is especially nice for serverless.

The concern I have here is introducing the idea of levels. TieredMergePolicy is (by design) a greedy one that tries to fill up segments to their max size (5gb). To this end it can merge non-adjacent segments in an overall attempt to minimize the total number of segments. This was chosen by design since more segments typically result in higher search latencies. Perhaps concurrent search may help reduce these latencies (and thus lessen these concerns) but it seems to me that introducing levels moves back to a design more similar to LogByteSizeMergePolicy where only those segments within two adjacent levels are candidates for merge. It seems that design means smaller segments (e.g., recently flushed segments in L0) will be not be considered for merge with larger ones, thus those optimizations are lost?

@mikemccand wrote the book on these policies so I'm curious what he thinks.

Bukhtawar commented 11 months ago

Thanks @RS146BIJAY +1 on adaptive merge policy. I think the overall proposal is a step in the right direction. It would be good to see some quick PoC and numbers around the algorithm changes and corresponding workloads that get to benefit from it.

gashutos commented 11 months ago

Nice, looks promising for workloads where update/delete is very frequent. Thanks @RS146BIJAY. By default, a segment can have 50% of deleted documents at max ! index.merge.policy.max_merged_segment setting has 50% value. With 50% deleted documents, this will be huge overhead on search operations as decoding from posting has to happen even for deleted documents and they will occupie additional memory. @Bukhtawar mentioned, lets get maximum opportunity we have here, keeping 50% updated documents on a segment and compare with 0 deleted similar segment (similar number of live docs). This will get us worth of entire work that how much we can benefit maximum. I would start with search latency numbers for comparison. We can see later the trade off for additional segment merges operations due to fluid merge policy as suggested.

Also, it looks like the primary motive of the proposal is to reclaim deleted documents as much as we can through tuning merges. TieredMergePolicy too provides index.merge.policy.reclaim_deletes_weight setting where it gives more weight to higher deleted documents segments. We might want to tune TieredMergePolicy itself w.r.t current deleted document statistics and dynamically change this setting to reclaim deleted documents.

backslasht commented 11 months ago

Thanks @RS146BIJAY for the proposal.

It would be great if you can share the gains/improvements that you are observing when using the Leveled Merge Policy.

RS146BIJAY commented 11 months ago

Below result compares how obsolete entries degrade search performance by comparing search latency when there is no obsolete entries vs search latency when there is 20 % obsolete entries (maximum obsolete entries that OpenSearch can have). We will be using below numbers as baseline for evaluating our level merge policy implementation. The idea is new merge policy can merge aggressively to keep obsolete entry low when search traffic is high. So maximum gain we can get from new merge policy is as maximum performance degradation with current merge policy due to obsolete entry.

We used nyc_taxis track for the below run:

Metrics	Aggregation	With 0% obselete entries	With 20% obselete entries	% degradation in latency

default	50th percentile latency	19.94	20.132	0.96289
default	90th percentile latency	20.828	22.258	6.86576
default	99th percentile latency	22.26	27.882	25.25606
default	100th percentile latency	22.636	29.386	29.81976
range	50th percentile latency	181.674	218.144	20.07442
range	90th percentile latency	184.286	222.59	20.78508
range	99th percentile latency	189.898	231.756	22.04236
range	100th percentile latency	194.306	236.372	21.64936
autohisto_agg	50th percentile latency	527.08	586.344	11.24383
autohisto_agg	90th percentile latency	528.95	589.13	11.37726
autohisto_agg	99th percentile latency	534.092	596.522	11.689
autohisto_agg	100th percentile latency	539.318	601.18	11.47041
date_histogram_agg	50th percentile latency	532.018	607.402	14.16945
date_histogram_agg	90th percentile latency	533.4	611.264	14.59768
date_histogram_agg	99th percentile latency	534.862	615.84	15.13998
date_histogram_agg	100th percentile latency	537.658	624.888	16.22407

nknize commented 11 months ago

The idea is new merge policy can merge aggressively to keep obsolete entry low when search traffic is high.

Are you proposing to dynamically close and re-open the IndexWriter based on search load? Or change OpenSearch's default MergePolicy with an Adaptive one that delegates to TieredMergePolicy or LogByteSizeMergePolicy and auto-tunes those policy settings based on search load? How do you plan to reproducibly benchmark those real world dynamics?

Maybe there's an opportunity to add some auto-tuning logic to OpenSearchConcurrentMergeScheduler based on load?

Also curious on answers to some of the previous questions re: TieredMergePolicy's greedy design at odds with the proposed level groups thus preventing large segments from merging with smaller ones (please correct me if I'm tracking the proposal incorrectly).

Curious to get some other folks looking at this as well so they can track and weigh in and ask additional questions. /cc @rishabh6788 @msfroh @andrross @mch2 @dreamer-89

rishabh6788 commented 11 months ago

Tagging the correct Rishabh @rishabhmaurya :)

nknize commented 11 months ago

Tagging the correct Rishabh @rishabhmaurya :)

Thank you! I think I made that mistake before 🤦🏻‍♂️ SryAboutThat!

rishabhmaurya commented 11 months ago

@RS146BIJAY Thanks for looking into ideas to improve merge policies. I think there is nothing OpenSearch exclusive here, so moving it to Lucene would give us a diverse opinion.

In your proposal there are lot of terminologies copied from RocksDB paper which doesn't directly translates to Lucene very well. RocksDB is a key value store and all the query types you have mentioned Update, PointLookUp & RangeLookup are performed much differently in RocksDB when compared to Lucene, so copying comparison here is of no value as read amplification would be different for both of them. Also, the SST files are sorted by keys and ordered by key ranges. It also supports update of keys in L0. These differences in underlying system are important while designing merge policy in my opinion. Also, the concept of tiering in this paper seems slightly different from what we have in Lucene TieredMergePolicy as we have a lot knobs to control like max segment per tier, max segment size and merge factor.

The main difference I see between 2 approaches is RocksDB Level merge policy is trying to merge a segment from lower level to the segments in the next level, another way to put it is merging a relatively small segment(s) with larges ones at next level, which, I agree, would result in cleaning up a lot of obselete docs when compared to tiered merge policy(which always considers segments at same level and of similar sizes to merge for lower write amplification) but it trades off write amplification and cascade merging of segment to lower level (tiered merge policy is pretty good at it as it works on budget model). Also, TieredMergePolicy works on choosing segments where reclaim from deleted documents is high, so it already tries to address the problem. As mentioned earlier by @gashutos, I would work backward on what problem we are addressing, if its update/read heavy, I would play by customizing existing TieredMergePolicy to reclaim deleted docs more aggressively (if its not doing already or isn't possible to tweak one of the existing settings) and try to address the concern instead of introducing the whole new merge policy.

I agree with @nknize that idea of having adaptive merge policy (or merge policy settings) depending on workload is quite fascinating. I would love to hear the answer which Nick raised on how auto tuning will translate internally.

PS: I think both discussions could be separated out - 1. Use of Level merge policy/optimizing merge policy for read/update heavy use cases 2. Use of adaptive merge policy.

RS146BIJAY commented 11 months ago

Are you proposing to dynamically close and re-open the IndexWriter based on search load?

Did not got this entirely. You meant dynamically close and re-open the IndexReader which happens during reader refresh? Or this is for changing underlying merge policy and closing and reopening IndexWriter?

Or change OpenSearch's default MergePolicy with an Adaptive one that delegates to TieredMergePolicy or LogByteSizeMergePolicy and auto-tunes those policy settings based on search load?

Proposal is to have a new single Adaptive merge policy. Here by Adaptive I meant a single policy can behave as both Tiered as well as Leveled merge policy (as per workload) by controlling frequency of merging at initial L-1 levels and last level L. We are controlling the number of segments in initial L-1 levels by a parameter K and last level by a parameter Z. This Fluid merge policy will initially have the configurations of Lazy leveling with more segments at initial level and only segment at last level. Now suppose the read traffic is high it will merge more aggressively at the initial L - 1 levels, making it behave like Leveled merge policy. Incase write traffic goes high, it can reduce the number of merges at the last level making it behave like Tiered merge policy.

PS: I am mentioning K and Z in terms of segments here, ideally it should be runs (which is bascially a collection of segments). Also Tiered Merge policy mentioned here is slightly different than that Lucene is using, so used the term Lucene uses an implementation of Tiered Merge policy.

Maybe there's an opportunity to add some auto-tuning logic to OpenSearchConcurrentMergeScheduler based on load?

I think the confusion here is by Adaptive we are switching merge policies as per load. This is not the case. Adpative merge policy means the same new merge policy will be behaving like Tiered, Leveled or Lazy leveling as per requirement.

TieredMergePolicy's greedy design at odds with the proposed level groups thus preventing large segments from merging with smaller ones

For above benchmark results where we updated about 20% of the documents, we saw this issue. The 5 GB limit on the max segment size was causing larger segments (around 5 GB size) to have more obsolete entries (around 60% obsolete entries in one single segment) which started impacting search performance. Essentially this is preventing larger segments to be merged less frequently or not getting merged at all causing search performance degradation.

PS: For updating 20% entries, we ran an update by query command on nyc_taxis index where vendor_id = 1 with max doc count parameter.

nknize commented 11 months ago

Or this is for changing underlying merge policy and closing and reponing IndexWriter?

Merge policies and schedulers are set on the IndexWriter which is configured through the OpenSearch Engine. I thought you were suggesting an adaptive merge scheduler approach that would use a new LevelMergePolicy similar to OpenSearchTieredMergePolicy that groups segments and delegates and tunes the settings of the optimal MergePolicy based on search AND indexing workload. In this case, I agree w/ @rishabhmaurya that these conversations can be teased into separate discussion topics. It seems this proposal is mostly just focused around reclaiming deletes which has had quite a bit of improvement over the last few years. Have a look at the simple LUCENE-10574 PR, for example. It would be good to open this discussion upstream and propose this for Lucene in general to get feedback from other committers. You can also prototype and unit test your proposal quicker using BaseMergePolicyTestCase which does a good job of simulating some extreme and esoteric cases.

By default, a segment can have 50% of deleted documents at max ! index.merge.policy.max_merged_segment setting has 50% value.

Just to clarify here, the DEFAULT_DELETE_PCT_ALLOWED is 20% which corresponds with the Lucene default. index.merge.policy.max_merged_segment can't exceed 50%.

RS146BIJAY commented 11 months ago

Ack. Will try to separate out discussions for Workload based adaptive merge policy with Fluid LSM merge policy that we suggested in this proposal.

gashutos commented 11 months ago

For https://github.com/opensearch-project/OpenSearch/issues/10340#issuecomment-1765689620 where we updated about 20% of the documents, we saw this issue. The 5 GB limit on the max segment size was causing larger segments (around 5 GB size) to have more obsolete entries (around 60% obsolete entries in one single segment) which started impacting search performance. Essentially this is preventing larger segments to be merged less frequently or not getting merged at all causing search performance degradation.

PS: For updating 20% entries, we ran an update by query command on nyc_taxis index where vendor_id = 1 with max doc count parameter.

@RS146BIJAY These are skewed results because deleted documents are pretty high on single segment the way we updated documents (Like couple of segments with 60% deleted documents and rest of the segments doesnt have any deleted entries). I think we need to see with different percentage of deleted documents, how search performance is behaving. I would suggest to benchmark and put numbers for sort/range/aggregation/match queries on nyc_taxis workload for segments containing 15%, 30%, 45%, .... deleted entires. That will help us deciding the concrete approach as well to further direction at what to look at.

Just to clarify here, the DEFAULT_DELETE_PCT_ALLOWED is 20% which corresponds with the Lucene default. index.merge.policy.max_merged_segment can't exceed 50%.

@nknize If segment is already 5 GB, I think in that case, index.merge.policy.deletes_pct_allowed this wont be honored. Like mentioned above, @RS146BIJAY is seeing deleted entried per segment around 60%.

reta commented 11 months ago

@nknize If segment is already 5 GB, I think in that case, index.merge.policy.deletes_pct_allowed this wont be honored. Like mentioned above, @RS146BIJAY is seeing deleted entried per segment around 60%.

@gashutos I think it was addressed recently by https://github.com/opensearch-project/OpenSearch/pull/10036, no?

msfroh commented 11 months ago

@gashutos I think it was addressed recently by https://github.com/opensearch-project/OpenSearch/pull/10036, no?

That just avoids going over 5GB (or the configured max segment size) when using only_expunge_deletes.

In other words, you can use only_expunge_deletes as a way of escaping from a "too many deletes" scenario without accidentally creating segments that are too big to participate in normal merges. If you're over the max segment size (e.g. because you did a force merge without only_expunge_deletes or because you lowered the max segment size), the behavior won't change I believe.

RS146BIJAY commented 10 months ago

@gashutos I believe index.merge.policy.deletes_pct_allowed is on Shard level not on segment level (in this doc, index means a lucene index). Validated this by creating index (3 primary, 1 replica) and updating 20% entries that the 20% limit was maintained on an individual shard level though obsolete entries on individual segments is reaching above 40%. Even the check inside merge policy is also done on individual shard level (totalDelPct <= deletesPctAllowed), therefore it is not allowing large segments (> 5 GB) over 60% obsolete entries for merges in the above case as total delete percentage of the corresponding shard is still less than 20% (totalDelPct <= deletesPctAllowed) and segment is removed from getting merged.

gashutos commented 10 months ago

@gashutos I believe index.merge.policy.deletes_pct_allowed is on Shard level not on segment level (in this doc, index means a lucene index). Validated this by creating index (3 primary, 1 replica) and updating 20% entries that the 20% limit was maintained on an individual shard level though obsolete entries on individual segments is reaching above 40%. Even the check inside merge policy is also done on individual shard level (totalDelPct <= deletesPctAllowed), therefore it is not allowing large segments (> 5 GB) over 60% obsolete entries for merges in the above case as total delete percentage of the corresponding shard is still less than 20% (totalDelPct <= deletesPctAllowed) and segment is removed from getting merged.

We should try to tune (new implementation) this limit at segment level.

opensearch-project / OpenSearch