Open travisbenedict opened 2 years ago
I think I'm hitting this too... 8-10x search latency increase on 7.9 vs 7.1. Do certain types of queries / operations cause this more than others?
Also, from the Lucene bug I see this:
"I have added a second constructor that takes both the IndexInput and a boolean for on/off-heap. The first constructor will select off-heap iff the IndexInput extends ByteBufferIndexInput."
I'm wondering if there's any way to make IndexInput not an instance of ByteBufferIndexInput. I don't know the code, so I don't know what else it possibly could be, or how to get there, or what (if any) ramifications that might have.
It was made default in LUCENE-9398 from 8.6. We need support in lucene that can be exposed as setting in Opensearch for moving it on/off heap as per the use case
+1 regarding the feature to enable/disable BKDReader on-heap/off-heap, for select use cases. Few teams upgrading from ES 7.1 to ES 7.9 are reporting increased SearchLatencies (as tracked via this issue).
My limited understanding is that BKDReader data structure is used for searching and indexing high-dimensional data. BKDReader off heap was introduced as an optimization to reduce memory usage and improve overall scalability of Elasticsearch.
May I request additional insights and guidance to below mentioned topics - as part of this github issue (Or guide me if the below mentioned topics should be discussed and addressed separately)?
More context is needed here. Direct memory (mmaped files) is subject to swapping. Was this tested w/ swapping enabled (which can seriously impact a nodes performance) or disabled? e.g.,:
Was swapping enabled? Try rerunning w/ and w/o swap enabled and report findings.
Try memory locking the process bootstrap.memory_lock true
We're ulimits
set to unlimited?
opensearch soft memlock unlimited
opensearch hard memlock unlimited
Was swapping enabled?
No, swapping is disabled due to performance degradation concerns
We're ulimits set to unlimited?
Yes they are:
% cat '/etc/security/limits.d/10-memlock.conf'
* hard memlock unlimited
* soft memlock unlimited
In some cases this can cause the BKDReader to consume more CPU cycles and increase request latency
Can we get more information on the "some cases"? The flamegraph is pretty but we need more context. Can we get a query profile before/after? Field mappings? Segment geometry?
More specifically, this change was made in Lucene which impacts not just OpenSearch but Solr and Elasticsearch. I have yet to see any concern from the latter two. It's possible they haven't hit it yet, but any changes to upstream Lucene will need more context with supporting benchmarks to clearly show this regression results from the BKD on/off heap changes.
With supporting query information it will also be easier to repro this using the lucene benchmark tool to support a change in the upstream. It's unfortunate no benchmarks were provided with the original change.
As long as the index is fully hot (is it here?), moving the BKD index off-heap should not cause anything near the performance regression that the o.p. flame charts seem to indicate. The BKD index is still in RAM in both cases, just accessed via slightly different APIs. Also, traversing the BKD index is not the hot spot over the overall BKDReader.intersect
-- visiting the leaf blocks (which are off-heap in both cases) is where the heavy work happens.
Could you attach a delta flame-chart as well? That would show us precisely what's slower and what's faster. And, any way to attach or link to a fully interactive flame chart? We need the equivalent of Pastebin but slightly interactive so I can drill down on the flame chart.
Also note that in Lucene's nightly benchmarks, the IntNRQ
query tasks tests BKD performance, and it does not show any regression in the June 20, 2020 time-frame:
https://home.apache.org/~mikemccand/lucenebench/IntNRQ.html
Aside: it's nice to see the GC reduction from old to new Lucene versions in the o.p. flame charts.
As long as the index is fully hot (is it here?)
Good question. I assumed it was during the run but the description doesn't give any specifics on the configuration when the the flame graph was obtained. Was this obtained from a single node in a cluster? Was this a single node setup? What else was going on during the query (e.g., concurrent indexing) or is this isolated to the query only? Is there a hot threads dump?
The more details to help here the better.
For the offending queries you might also try warming the page cache on those indexes using the "index.store.preload": ["*"]
index setting to make sure the index is hot. Although if this regression is seen in a production cluster I think that could be evicted based on the 30second Shard Idle default. That might be another area to look at, increase index.search.idle.after
to try and prevent shard idle (e.g., file cache evictions). I think if you set this to -1
it will disable? (double check that)
Which Directory
implementation is in use here @travisbenedict? If it's a buffered implementation (SimpleFSDirectory
or NIOFSDirectory
) can you try switching to MMapDirectory
instead? The buffered reads are sometimes costly, e.g. see https://github.com/apache/lucene/issues/12356.
Also, if possible, please render flame charts to SVG so they remain interactive after attaching to GitHub issues.
Is your feature request related to a problem? Please describe. In Lucene 8.4 the BKDReader packedIndex was moved off heap (LUCENE-8932). In some cases this can cause the BKDReader to consume more CPU cycles and increase request latency
Describe the solution you'd like Add a setting to allow the BKDReader packedIndex to be moved on heap.
Describe alternatives you've considered Move the BKDReader packedIndex on heap by default, but this would likely increase memory consumption for users.
Additional context Similar issue - https://github.com/opensearch-project/OpenSearch/issues/825
Flame graphs are included for Elasticsearch versions 7.9 and 7.1
ES 7.9 Lucene 8.6.2
ES 7.1 Lucene 8.0.0