Open expani opened 6 months ago
@peternied I will run some OSB benchmarks with my changes to Lucene core and update here with the findings.
Performed a POC on OS 2.13 ( Lucene 9.10 ) by prefix encoding the docIds for 24 and 32 bit cases.
kdd
file size is seeing a reduction of 17% on NYC Taxi index based out of OSB without any segment merges.
Baseline OS 2.13 => 5.603 GB
ll DocIdsBenchmark/opensearch-2.13.0-unpatch/data/nodes/0/indices/byUCjsO_T_yn2dBgs6Owaw/0/index/ | grep kdd
-rw-r--r--. 1 ec2-user ec2-user 1480153195 May 28 15:07 _7w.kdd
-rw-r--r--. 1 ec2-user ec2-user 1154177696 May 28 15:10 _cf.kdd
-rw-r--r--. 1 ec2-user ec2-user 1482800206 May 28 15:19 _k5.kdd
-rw-r--r--. 1 ec2-user ec2-user 1486112481 May 28 15:25 _q2.kdd
Prefix Encoded OS 2.13 => 4.663 GB
ll DocIdsBenchmark/opensearch-2.13.0-patch/data/nodes/0/indices/7UMER0DZRuCOviarxAQmOw/0/index/ | grep kdd
-rw-r--r--. 1 ec2-user ec2-user 1149777642 May 28 12:56 _5k.kdd
-rw-r--r--. 1 ec2-user ec2-user 1290313662 May 28 13:04 _d8.kdd
-rw-r--r--. 1 ec2-user ec2-user 1299432668 May 28 13:13 _lh.kdd
-rw-r--r--. 1 ec2-user ec2-user 924449744 May 28 13:14 _ns.kdd
I am seeing poor performance across different queries and the reason looks like CPU branch mis-prediction.
Async profiler command used
./bin/asprof `jps | grep OpenSearch | awk '{print $1}'` -e branch-misses -f unpatch_branch_miss_range_nyctaxi_3.html -d 150
Flamegraph of CPU Branch miss for baseline OS 2.13
Overall Branch mis-prediction in readInts24
and readints32
is cumulatively around 6-7%
Flamegraph of CPU Branch miss for prefix encoding OS 2.13
Overall Branch mis-prediction in readIntsPrefix
( new method that decodes prefix encoded docIds ) is cumulatively around 25-30%
I will try to explore the possibility of changing the decoding logic for unpacking prefix encoded docIds ( will try changing encoding later if this is successful ) to not use any conditional (if-else) statements.
TL;DR
Performance gain in reading less data from the disk is getting offset by the increase in time to decode docIds ( involves bitwise and arithmetic operators like + and - )
If the decoding can be reduced to 1/2 operators ( preferably bitwise ) per docId then the decoding time can be made negligible similar to readInts24
in the existing implementation.
DETAILS
I tried 2 more approaches of decoding the docIds :
All the approaches tried so far take longer than the gains achieved by reading less data from Disk. The configuration of index.store.type
was changed to niofs
to always enable disk reads.
For the range query benchmark in nyc_taxis
on only using the optimisation in readInts32
, the difference b/w the existing and common-prefix approaches are as follows :
Please note this is only for a sub-section of the flamegraph ( the biggest part ) but can be seen throughout.
EXISTING
It takes around 15% for the readInts32
to load leaf blocks from disk into an in-memory array ( Worse )
STORING COMMON PREFIX
It takes around 11% for the readInts32CommonPrefix
to load leaf blocks from disk into an in-memory array ( Better than 4% )
It takes around 9% exclusively for decoding the packed DocIds after the leaf blocks are read into memory ( Worse )
Considering these, the 4% gain in loading less data from disk in the common prefix approach is getting offset by the time taking to decode the docIds by around 5%.
On closer inspection of the existing implementation, the range query docId traversal is being performed using getInverseIntersectVisitor
which only visit docIds that don't match. This involves clearing bits from FixedBitSet#clearAll
as can be seen in the flamegraph's above.
This exclusive operation of FixedBitSet#clear
is taking around 7.5-8% as compared to 9% of decoding docIds. This only involves 4 bitwise operations ( 2 shifts, 1 & and 1 two's complement ) per docId.
I am purposefully ignoring the for loop operators as they seem to have negligible overhead ( verified by breaking down readInts24
further )
public void clearAll(IntsRef ref) {
for (int i = ref.offset; i < ref.offset + ref.length; i++) {
int index = ref.ints[i];
int wordNum = index >> 6;
long bitmask = 1L << index;
bits[wordNum] &= ~bitmask;
}
}
So, this helps us in reaching the conclusion that unless we change the decoding to use 1-2 bitwise/fast arithmetic (like +/- ) operators per docId ( not sure if possible with common prefix approach ) it will be hard to offset the performance gains achieved by reading less data from disk.
All the new changes done since the last update can be found with this commit
Opened https://github.com/apache/lucene/pull/13521 in Lucene to introduce BPV_21 for BKD Tree DocId encodings
Is your feature request related to a problem? Please describe
Lucene represents the docIds using 24 bits if the maximum docId within a leaf is <=
16,777,215
here else it represents the docId as a regular integer using 32 bits without any encoding.A lucene segment can have around 2 billion docs and a significant amount of docIds might be represented using the 24 and 32 bits representation.
Most of the time spent in range queries are while collecting docIds readDelta16,
readInts24 and readInts32
If we can reduce the amount of bytes used to represent docIds, we can improve the reading time ( search ) especially for range queries over larger ranges and reduce the index size.
Some existing tickets that have already discussed this area are :
Describe the solution you'd like
I was trying to find the common prefix amongst the docIds when they are represented using 24 and 32 bit variation based on the NYC Taxi data for the field
fare_amount
with around 20 million docs.I found that a lot of leaf blocks that were represented using 24 bits had common prefixes ( from MSB/Leftmost bit ) ranging from 8bits to 15 bits. Having 8 bits common doesn't really help in encoding as the docIds will be represented using 24 bit. But, having 15 common bits can help us save 7 bits per docId in a leaf of 512 docIds reducing the number of bytes used to represent a leaf block by 445 bytes. ((512* 7 ) / 8 ) = 448- 3 bytes ( used to represent the common prefix as VInt)
Similarly when performing a force merge on the segment, Lucene also used 32 bits representation for around 1000 leaf blocks and I saw common prefixes ranging from 7-15 bits amongst them. The best case of 15 bit common prefix in this case would save 956 bytes per leaf block.
I am planning to introduce a new encoding for 24 and 32 bit representation which will also store the common prefix and benchmark the changes in indexing and search performance.
Related component
Search:Performance
Describe alternatives you've considered
No response
Additional context
No response