[Feature Request] The index adds automatic force merge function to reduce segments

kkewwei commented 2 weeks ago

Is your feature request related to a problem? Please describe

In our product, for small index (about 20g-), the frequency of writing/updating is not high, we have to frequently execute forceMerge segment to reduce the segments to improve query performance. As opensearch use TieredMergePolicy, which just merges segments of approximately equal size.

As ISM provides abality to forcemerge periodically, brings a lot of query glitches, the new created segment can't be merged quickly.

Describe the solution you'd like

Customizing/extending MergePolicy a supported API and designed for users in lucene, If we should support another MergePolicy in opensearch, which can auto merge as much as possible to reduce the number of segments.

If this is reasonable, I will follow up with how to design rules to auto force merge.

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

andrross commented 1 week ago

@msfroh, can you follow up with any tuning parameters available for the existing merge policies which might be able to solve the problem here?

msfroh commented 1 week ago

Hey @kkewwei -- I think there might be a few knobs you can try tuning on TieredMergePolicy to merge small segments without making the merges too expensive:

index.merge.policy.floor_segment: This sets the size of the lowest "tier", where everything <= the value is considered part of that same tier and eligible to participate in a merge (where the output is one tier higher). The default is 2MB. If you increase it, then more small segments can be eligible to get merged. (I think something like 50 or 100MB is probably more reasonable for all but the tiniest indices.) Small segments may get merged multiple times as a result (e.g. 2MB segments get merged into a 20MB segment, then that gets merged with other 2MB segments to produce a 38MB segment, etc). Usually that's fine, though, as the merges under 100MB tend to be really fast.
index.merge.policy.segments_per_tier: This is the number of segments in the same tier that will get selected for a merge. The default value of 10 means that you need a lot of segments in the same tier before a merge kicks in. Lowering it to something like 5 will encourage small segments to get merged and generally reduce the overall segment count. It does mean that a little more overall compute effort will be spent on merging. For index-heavy workloads, maybe it's not worth it, but for search-heavy workloads, the lower value is usually better.

There might be some other parameters worth tuning, but I think those two would be a good start.

kkewwei commented 1 week ago

Hey @kkewwei -- I think there might be a few knobs you can try tuning on TieredMergePolicy to merge small segments without making the merges too expensive:

index.merge.policy.floor_segment: This sets the size of the lowest "tier", where everything <= the value is considered part of that same tier and eligible to participate in a merge (where the output is one tier higher). The default is 2MB. If you increase it, then more small segments can be eligible to get merged. (I think something like 50 or 100MB is probably more reasonable for all but the tiniest indices.) Small segments may get merged multiple times as a result (e.g. 2MB segments get merged into a 20MB segment, then that gets merged with other 2MB segments to produce a 38MB segment, etc). Usually that's fine, though, as the merges under 100MB tend to be really fast.

index.merge.policy.segments_per_tier: This is the number of segments in the same tier that will get selected for a merge. The default value of 10 means that you need a lot of segments in the same tier before a merge kicks in. Lowering it to something like 5 will encourage small segments to get merged and generally reduce the overall segment count. It does mean that a little more overall compute effort will be spent on merging. For index-heavy workloads, maybe it's not worth it, but for search-heavy workloads, the lower value is usually better.

There might be some other parameters worth tuning, but I think those two would be a good start.

@msfroh, Thank you for your reply, I will test how many the segments can be reduced to with the two parameters.

opensearch-project / OpenSearch