Support for disabling BKDReader packedIndex off heap

travisbenedict commented 2 years ago

Is your feature request related to a problem? Please describe. In Lucene 8.4 the BKDReader packedIndex was moved off heap (LUCENE-8932). In some cases this can cause the BKDReader to consume more CPU cycles and increase request latency

Describe the solution you'd like Add a setting to allow the BKDReader packedIndex to be moved on heap.

Describe alternatives you've considered Move the BKDReader packedIndex on heap by default, but this would likely increase memory consumption for users.

Additional context Similar issue - https://github.com/opensearch-project/OpenSearch/issues/825

Flame graphs are included for Elasticsearch versions 7.9 and 7.1

ES 7.9 Lucene 8.6.2 79FlameGraph

ES 7.1 Lucene 8.0.0 71FlameGraph

superawesome commented 2 years ago

I think I'm hitting this too... 8-10x search latency increase on 7.9 vs 7.1. Do certain types of queries / operations cause this more than others?

Also, from the Lucene bug I see this:

"I have added a second constructor that takes both the IndexInput and a boolean for on/off-heap. The first constructor will select off-heap iff the IndexInput extends ByteBufferIndexInput."

I'm wondering if there's any way to make IndexInput not an instance of ByteBufferIndexInput. I don't know the code, so I don't know what else it possibly could be, or how to get there, or what (if any) ramifications that might have.

jainankitk commented 1 year ago

It was made default in LUCENE-9398 from 8.6. We need support in lucene that can be exposed as setting in Opensearch for moving it on/off heap as per the use case

manojfaria commented 1 year ago

+1 regarding the feature to enable/disable BKDReader on-heap/off-heap, for select use cases. Few teams upgrading from ES 7.1 to ES 7.9 are reporting increased SearchLatencies (as tracked via this issue).

My limited understanding is that BKDReader data structure is used for searching and indexing high-dimensional data. BKDReader off heap was introduced as an optimization to reduce memory usage and improve overall scalability of Elasticsearch.

May I request additional insights and guidance to below mentioned topics - as part of this github issue (Or guide me if the below mentioned topics should be discussed and addressed separately)?

List the query patterns that exercise BKDReader. In my limited experience, BKDReader gets exercised when queries are executed on numeric fields.
What are the "access patterns", "use cases" or "scenarios" wherein BKDReader on-heap performs better than BKDReader off-heap (and vice-versa)? It appears that BKDReader on-heap may be better suited when query latency/performance is of utmost importance, and dimensionality of underlying data is not very high. Thoughts?

nknize commented 1 year ago

More context is needed here. Direct memory (mmaped files) is subject to swapping. Was this tested w/ swapping enabled (which can seriously impact a nodes performance) or disabled? e.g.,:

Was swapping enabled? Try rerunning w/ and w/o swap enabled and report findings.
Try memory locking the process bootstrap.memory_lock true
We're ulimits set to unlimited?

opensearch soft memlock unlimited
opensearch hard memlock unlimited

jainankitk commented 1 year ago

Was swapping enabled?

No, swapping is disabled due to performance degradation concerns

We're ulimits set to unlimited?

Yes they are:

% cat '/etc/security/limits.d/10-memlock.conf'
*         hard    memlock           unlimited
*         soft    memlock           unlimited

nknize commented 1 year ago

In some cases this can cause the BKDReader to consume more CPU cycles and increase request latency

Can we get more information on the "some cases"? The flamegraph is pretty but we need more context. Can we get a query profile before/after? Field mappings? Segment geometry?

More specifically, this change was made in Lucene which impacts not just OpenSearch but Solr and Elasticsearch. I have yet to see any concern from the latter two. It's possible they haven't hit it yet, but any changes to upstream Lucene will need more context with supporting benchmarks to clearly show this regression results from the BKD on/off heap changes.

With supporting query information it will also be easier to repro this using the lucene benchmark tool to support a change in the upstream. It's unfortunate no benchmarks were provided with the original change.

mikemccand commented 1 year ago

As long as the index is fully hot (is it here?), moving the BKD index off-heap should not cause anything near the performance regression that the o.p. flame charts seem to indicate. The BKD index is still in RAM in both cases, just accessed via slightly different APIs. Also, traversing the BKD index is not the hot spot over the overall BKDReader.intersect -- visiting the leaf blocks (which are off-heap in both cases) is where the heavy work happens.

Could you attach a delta flame-chart as well? That would show us precisely what's slower and what's faster. And, any way to attach or link to a fully interactive flame chart? We need the equivalent of Pastebin but slightly interactive so I can drill down on the flame chart.

Also note that in Lucene's nightly benchmarks, the IntNRQ query tasks tests BKD performance, and it does not show any regression in the June 20, 2020 time-frame:

https://home.apache.org/~mikemccand/lucenebench/IntNRQ.html

Aside: it's nice to see the GC reduction from old to new Lucene versions in the o.p. flame charts.

nknize commented 1 year ago

As long as the index is fully hot (is it here?)

Good question. I assumed it was during the run but the description doesn't give any specifics on the configuration when the the flame graph was obtained. Was this obtained from a single node in a cluster? Was this a single node setup? What else was going on during the query (e.g., concurrent indexing) or is this isolated to the query only? Is there a hot threads dump?

The more details to help here the better.

nknize commented 1 year ago

For the offending queries you might also try warming the page cache on those indexes using the "index.store.preload": ["*"] index setting to make sure the index is hot. Although if this regression is seen in a production cluster I think that could be evicted based on the 30second Shard Idle default. That might be another area to look at, increase index.search.idle.after to try and prevent shard idle (e.g., file cache evictions). I think if you set this to -1 it will disable? (double check that)

mikemccand commented 1 year ago

Which Directory implementation is in use here @travisbenedict? If it's a buffered implementation (SimpleFSDirectory or NIOFSDirectory) can you try switching to MMapDirectory instead? The buffered reads are sometimes costly, e.g. see https://github.com/apache/lucene/issues/12356.

mikemccand commented 1 year ago

Also, if possible, please render flame charts to SVG so they remain interactive after attaching to GitHub issues.

opensearch-project / OpenSearch

Support for disabling BKDReader packedIndex off heap #2657