Open RS146BIJAY opened 11 months ago
Would appreciate feedback from @shwetathareja, @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, @reta and @backslasht as well, so tagging more folks for visibility. Thanks!
...data is range partitioned into multiple SST files.
This is copy/paste from RocksDB? I'm assuming you mean multiple lucene "segments" instead of SST files?
...data on disk are organized in multiple levels.
Am I tracking correctly that Lucene segments at Li are only merged with segments at Li+1 once the cumulative data size of Li reaches it's upper limit?
It enables switching between different merge policies as per the workload.
I really like the idea of an adaptive merge policy! It falls in the category of "auto tuning" which is especially nice for serverless.
The concern I have here is introducing the idea of levels. TieredMergePolicy
is (by design) a greedy one that tries to fill up segments to their max size (5gb). To this end it can merge non-adjacent segments in an overall attempt to minimize the total number of segments. This was chosen by design since more segments typically result in higher search latencies. Perhaps concurrent search may help reduce these latencies (and thus lessen these concerns) but it seems to me that introducing levels moves back to a design more similar to LogByteSizeMergePolicy
where only those segments within two adjacent levels are candidates for merge. It seems that design means smaller segments (e.g., recently flushed segments in L0) will be not be considered for merge with larger ones, thus those optimizations are lost?
@mikemccand wrote the book on these policies so I'm curious what he thinks.
Thanks @RS146BIJAY +1 on adaptive merge policy. I think the overall proposal is a step in the right direction. It would be good to see some quick PoC and numbers around the algorithm changes and corresponding workloads that get to benefit from it.
Nice, looks promising for workloads where update/delete is very frequent. Thanks @RS146BIJAY.
By default, a segment can have 50% of deleted documents at max ! index.merge.policy.max_merged_segment
setting has 50% value.
With 50% deleted documents, this will be huge overhead on search operations as decoding from posting has to happen even for deleted documents and they will occupie additional memory.
@Bukhtawar mentioned, lets get maximum opportunity we have here, keeping 50% updated documents on a segment and compare with 0 deleted similar segment (similar number of live docs). This will get us worth of entire work that how much we can benefit maximum. I would start with search
latency numbers for comparison. We can see later the trade off for additional segment merges operations due to fluid merge policy as suggested.
Also, it looks like the primary motive of the proposal is to reclaim deleted documents as much as we can through tuning merges. TieredMergePolicy
too provides index.merge.policy.reclaim_deletes_weight
setting where it gives more weight to higher deleted documents segments. We might want to tune TieredMergePolicy itself w.r.t current deleted document statistics and dynamically change this setting to reclaim deleted documents.
Thanks @RS146BIJAY for the proposal.
It would be great if you can share the gains/improvements that you are observing when using the Leveled Merge Policy
.
Below result compares how obsolete entries degrade search performance by comparing search latency when there is no obsolete entries vs search latency when there is 20 % obsolete entries (maximum obsolete entries that OpenSearch can have). We will be using below numbers as baseline for evaluating our level merge policy implementation. The idea is new merge policy can merge aggressively to keep obsolete entry low when search traffic is high. So maximum gain we can get from new merge policy is as maximum performance degradation with current merge policy due to obsolete entry.
We used nyc_taxis
track for the below run:
Metrics | Aggregation | With 0% obselete entries | With 20% obselete entries | % degradation in latency |
---|---|---|---|---|
default | 50th percentile latency | 19.94 | 20.132 | 0.96289 |
default | 90th percentile latency | 20.828 | 22.258 | 6.86576 |
default | 99th percentile latency | 22.26 | 27.882 | 25.25606 |
default | 100th percentile latency | 22.636 | 29.386 | 29.81976 |
range | 50th percentile latency | 181.674 | 218.144 | 20.07442 |
range | 90th percentile latency | 184.286 | 222.59 | 20.78508 |
range | 99th percentile latency | 189.898 | 231.756 | 22.04236 |
range | 100th percentile latency | 194.306 | 236.372 | 21.64936 |
autohisto_agg | 50th percentile latency | 527.08 | 586.344 | 11.24383 |
autohisto_agg | 90th percentile latency | 528.95 | 589.13 | 11.37726 |
autohisto_agg | 99th percentile latency | 534.092 | 596.522 | 11.689 |
autohisto_agg | 100th percentile latency | 539.318 | 601.18 | 11.47041 |
date_histogram_agg | 50th percentile latency | 532.018 | 607.402 | 14.16945 |
date_histogram_agg | 90th percentile latency | 533.4 | 611.264 | 14.59768 |
date_histogram_agg | 99th percentile latency | 534.862 | 615.84 | 15.13998 |
date_histogram_agg | 100th percentile latency | 537.658 | 624.888 | 16.22407 |
The idea is new merge policy can merge aggressively to keep obsolete entry low when search traffic is high.
Are you proposing to dynamically close and re-open the IndexWriter based on search load? Or change OpenSearch's default MergePolicy with an Adaptive one that delegates to TieredMergePolicy or LogByteSizeMergePolicy and auto-tunes those policy settings based on search load? How do you plan to reproducibly benchmark those real world dynamics?
Maybe there's an opportunity to add some auto-tuning logic to OpenSearchConcurrentMergeScheduler
based on load?
Also curious on answers to some of the previous questions re: TieredMergePolicy's greedy design at odds with the proposed level groups thus preventing large segments from merging with smaller ones (please correct me if I'm tracking the proposal incorrectly).
Curious to get some other folks looking at this as well so they can track and weigh in and ask additional questions. /cc @rishabh6788 @msfroh @andrross @mch2 @dreamer-89
Tagging the correct Rishabh @rishabhmaurya :)
Tagging the correct Rishabh @rishabhmaurya :)
Thank you! I think I made that mistake before 🤦🏻♂️ SryAboutThat!
@RS146BIJAY Thanks for looking into ideas to improve merge policies. I think there is nothing OpenSearch exclusive here, so moving it to Lucene would give us a diverse opinion.
In your proposal there are lot of terminologies copied from RocksDB paper which doesn't directly translates to Lucene very well. RocksDB is a key value store and all the query types you have mentioned Update
, PointLookUp
& RangeLookup
are performed much differently in RocksDB when compared to Lucene, so copying comparison here is of no value as read amplification would be different for both of them. Also, the SST files are sorted by keys and ordered by key ranges. It also supports update of keys in L0. These differences in underlying system are important while designing merge policy in my opinion. Also, the concept of tiering in this paper seems slightly different from what we have in Lucene TieredMergePolicy as we have a lot knobs to control like max segment per tier, max segment size and merge factor.
The main difference I see between 2 approaches is RocksDB Level merge policy is trying to merge a segment from lower level to the segments in the next level, another way to put it is merging a relatively small segment(s) with larges ones at next level, which, I agree, would result in cleaning up a lot of obselete docs when compared to tiered merge policy(which always considers segments at same level and of similar sizes to merge for lower write amplification) but it trades off write amplification and cascade merging of segment to lower level (tiered merge policy is pretty good at it as it works on budget model). Also, TieredMergePolicy works on choosing segments where reclaim from deleted documents is high, so it already tries to address the problem. As mentioned earlier by @gashutos, I would work backward on what problem we are addressing, if its update/read heavy, I would play by customizing existing TieredMergePolicy to reclaim deleted docs more aggressively (if its not doing already or isn't possible to tweak one of the existing settings) and try to address the concern instead of introducing the whole new merge policy.
I agree with @nknize that idea of having adaptive merge policy (or merge policy settings) depending on workload is quite fascinating. I would love to hear the answer which Nick raised on how auto tuning will translate internally.
PS: I think both discussions could be separated out - 1. Use of Level merge policy/optimizing merge policy for read/update heavy use cases 2. Use of adaptive merge policy.
Are you proposing to dynamically close and re-open the IndexWriter based on search load?
Did not got this entirely. You meant dynamically close and re-open the IndexReader which happens during reader refresh? Or this is for changing underlying merge policy and closing and reopening IndexWriter?
Or change OpenSearch's default MergePolicy with an Adaptive one that delegates to TieredMergePolicy or LogByteSizeMergePolicy and auto-tunes those policy settings based on search load?
Proposal is to have a new single Adaptive merge policy. Here by Adaptive I meant a single policy can behave as both Tiered as well as Leveled merge policy (as per workload) by controlling frequency of merging at initial L-1 levels and last level L. We are controlling the number of segments in initial L-1 levels by a parameter K and last level by a parameter Z. This Fluid merge policy will initially have the configurations of Lazy leveling with more segments at initial level and only segment at last level. Now suppose the read traffic is high it will merge more aggressively at the initial L - 1 levels, making it behave like Leveled merge policy. Incase write traffic goes high, it can reduce the number of merges at the last level making it behave like Tiered merge policy.
PS: I am mentioning K and Z in terms of segments here, ideally it should be runs (which is bascially a collection of segments). Also Tiered Merge policy mentioned here is slightly different than that Lucene is using, so used the term Lucene uses an implementation of Tiered Merge policy
.
Maybe there's an opportunity to add some auto-tuning logic to OpenSearchConcurrentMergeScheduler based on load?
I think the confusion here is by Adaptive we are switching merge policies as per load. This is not the case. Adpative merge policy means the same new merge policy will be behaving like Tiered, Leveled or Lazy leveling as per requirement.
TieredMergePolicy's greedy design at odds with the proposed level groups thus preventing large segments from merging with smaller ones
For above benchmark results where we updated about 20% of the documents, we saw this issue. The 5 GB limit on the max segment size was causing larger segments (around 5 GB size) to have more obsolete entries (around 60% obsolete entries in one single segment) which started impacting search performance. Essentially this is preventing larger segments to be merged less frequently or not getting merged at all causing search performance degradation.
PS: For updating 20% entries, we ran an update by query command on nyc_taxis
index where vendor_id = 1
with max doc count parameter.
Or this is for changing underlying merge policy and closing and reponing IndexWriter?
Merge policies and schedulers are set on the IndexWriter which is configured through the OpenSearch Engine. I thought you were suggesting an adaptive merge scheduler approach that would use a new LevelMergePolicy similar to OpenSearchTieredMergePolicy that groups segments and delegates and tunes the settings of the optimal MergePolicy based on search AND indexing workload. In this case, I agree w/ @rishabhmaurya that these conversations can be teased into separate discussion topics. It seems this proposal is mostly just focused around reclaiming deletes which has had quite a bit of improvement over the last few years. Have a look at the simple LUCENE-10574 PR, for example. It would be good to open this discussion upstream and propose this for Lucene in general to get feedback from other committers. You can also prototype and unit test your proposal quicker using BaseMergePolicyTestCase
which does a good job of simulating some extreme and esoteric cases.
By default, a segment can have 50% of deleted documents at max ! index.merge.policy.max_merged_segment setting has 50% value.
Just to clarify here, the DEFAULT_DELETE_PCT_ALLOWED is 20% which corresponds with the Lucene default. index.merge.policy.max_merged_segment
can't exceed 50%.
Ack. Will try to separate out discussions for Workload based adaptive merge policy with Fluid LSM merge policy that we suggested in this proposal.
For https://github.com/opensearch-project/OpenSearch/issues/10340#issuecomment-1765689620 where we updated about 20% of the documents, we saw this issue. The 5 GB limit on the max segment size was causing larger segments (around 5 GB size) to have more obsolete entries (around 60% obsolete entries in one single segment) which started impacting search performance. Essentially this is preventing larger segments to be merged less frequently or not getting merged at all causing search performance degradation.
PS: For updating 20% entries, we ran an update by query command on nyc_taxis index where vendor_id = 1 with max doc count parameter.
@RS146BIJAY These are skewed results because deleted documents are pretty high on single segment the way we updated documents (Like couple of segments with 60% deleted documents and rest of the segments doesnt have any deleted entries). I think we need to see with different percentage of deleted documents, how search performance is behaving. I would suggest to benchmark and put numbers for sort/range/aggregation/match queries on nyc_taxis workload for segments containing 15%, 30%, 45%, .... deleted entires. That will help us deciding the concrete approach as well to further direction at what to look at.
Just to clarify here, the DEFAULT_DELETE_PCT_ALLOWED is 20% which corresponds with the Lucene default. index.merge.policy.max_merged_segment can't exceed 50%.
@nknize If segment is already 5 GB, I think in that case, index.merge.policy.deletes_pct_allowed
this wont be honored. Like mentioned above, @RS146BIJAY is seeing deleted entried per segment around 60%.
@nknize If segment is already 5 GB, I think in that case,
index.merge.policy.deletes_pct_allowed
this wont be honored. Like mentioned above, @RS146BIJAY is seeing deleted entried per segment around 60%.
@gashutos I think it was addressed recently by https://github.com/opensearch-project/OpenSearch/pull/10036, no?
@gashutos I think it was addressed recently by https://github.com/opensearch-project/OpenSearch/pull/10036, no?
That just avoids going over 5GB (or the configured max segment size) when using only_expunge_deletes
.
In other words, you can use only_expunge_deletes
as a way of escaping from a "too many deletes" scenario without accidentally creating segments that are too big to participate in normal merges. If you're over the max segment size (e.g. because you did a force merge without only_expunge_deletes
or because you lowered the max segment size), the behavior won't change I believe.
@gashutos I believe index.merge.policy.deletes_pct_allowed
is on Shard level not on segment level (in this doc, index means a lucene index). Validated this by creating index (3 primary, 1 replica) and updating 20% entries that the 20% limit was maintained on an individual shard level though obsolete entries on individual segments is reaching above 40%. Even the check inside merge policy is also done on individual shard level (totalDelPct <= deletesPctAllowed
), therefore it is not allowing large segments (> 5 GB) over 60% obsolete entries for merges in the above case as total delete percentage of the corresponding shard is still less than 20% (totalDelPct <= deletesPctAllowed
) and segment is removed from getting merged.
@gashutos I believe
index.merge.policy.deletes_pct_allowed
is on Shard level not on segment level (in this doc, index means a lucene index). Validated this by creating index (3 primary, 1 replica) and updating 20% entries that the 20% limit was maintained on an individual shard level though obsolete entries on individual segments is reaching above 40%. Even the check inside merge policy is also done on individual shard level (totalDelPct <= deletesPctAllowed
), therefore it is not allowing large segments (> 5 GB) over 60% obsolete entries for merges in the above case as total delete percentage of the corresponding shard is still less than 20% (totalDelPct <= deletesPctAllowed
) and segment is removed from getting merged.
We should try to tune (new implementation) this limit at segment level.
Problem
Currently OpenSearch uses an implementation of Tiered Merge policy for compaction. Tiered Merge policy is suited for write intensive workload as it has lower write amplification. A disadvantage of Tiered Merge policy is its Read and Space amplification is high. For merging a set of segments it requires 3 times the total size of segments that are merged (as Compound files are enabled), increasing space amplification. Also obsolete entries (created when there is updates or deletion) are slowly removed from the segments in tiered merge policy (as compared to Level Merge policy), increasing read amplification.
What are other alternatives
Leveled Merge Policy
Leveled merge policy is used for read intensive workload. In Leveled Compaction as well data on disk are organized in multiple levels. The size of each level is maintained at T times that of the previous level (generally T = 10). A special level-0 (or L0 for short) contains files just flushed from in-memory write buffer (memtable). All non-0 levels have target sizes. Inside each level (except level 0), data is range partitioned into multiple SST files. The size targets are usually exponentially increasing.
Segment merge gets triggered when the data size of a level reaches its upper limit. In that case, segments at the current level will be pushed to and merged with the segments of next level. The resultant segments will be placed in the next level. Since merging at current level happens every time previous level gets filled up and a new set of segments is pushed to this level, read amplification (since obselete entry will be less) and space amplification (since only a small set of segment is pushed and merged from previous level) is less as compared to tiered merge policy.
Issue with this merge policy is write amplification is very high, so it is suited for read intensive workload.
Leveled Merge Policy with lazy leveling
Leveled Merge policy with lazy leveling improves write amplication for Leveled merge by eliminates merging at all but the largest level. Lazy leveling at its core is a hybrid of leveling and tiering: it applies leveling at the largest level and tiering at all other levels.
In order to keep the cost complexity of point lookups fixed despite having more segments to probe at smaller levels, we keep FPR at lower level better than FPR at larger level. Therefore, relative to leveling with lazy leveling, we are able to (1) improves the cost complexity of updates, (2) maintains the same complexity for point lookups, long range lookups, and space-amplification, and (3) provides competitive performance for short range lookups.
Worst case cost comparisons between merge policies for different operations
where,
L = number of levels T = size ratio between adjacent levels M = main memory allocated to the Bloom filters N = total number of entries s = selectivity of a range lookup B = number of entries that fit into a storage block P = size of the buffer in storage blocks
Analysis summary
As it can be inferred from the above analysis, no single merge policy dominates the others universally. Lazy leveling is best for combined workloads consisting of updates, point lookups and long range lookups, whereas tiering and leveling are best for workloads comprising mostly updates or mostly lookups, respectively.
Proposal - Fluid LSM merge policy
Fluid LSM is an adaptive merge policy that generalises all the above merge policies (Tired, Leveled and Lazy leveling). It enables switching between different merge policies as per the workload. It does this by controlling the frequency of merge operations separately for the largest level and for all other levels.
In fluid LSM tree, there are at most Z runs at the largest level and at most K runs at each of the smaller levels. To maintain these bounds, every Level i has an active run into which we merge incoming runs from Level i − 1. Each active run has a size threshold with respect to the capacity of its level (total capacity comes out to T): T/K percent for Levels 1 to L − 1 and T/Z percent for Level L. When an active run reaches this threshold, we start a new active run at that level. Ultimately when a level is at capacity, all runs in it get merged and flushed down to the next level.
The bounds K and Z are used as tuning parameters that enable Fluid LSM-tree to assume the behaviors of different merge policies.
Fluid LSM-tree can transition from Lazy Leveling to tiering by merging less frequently at the largest level by increasing Z, or it can transition to leveling by merging more frequently at all other levels by decreasing K.
Further steps
Reference Paper
https://scholar.harvard.edu/files/stratos/files/dostoevskykv.pdf