mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Optimise SegmentTermsEnum.seekExact performance [LUCENE-8980] #977

Closed mikemccand closed 4 years ago

mikemccand commented 4 years ago

Description

In Elasticsearch, which is based on Lucene, each document has an indexed _id field that uniquely identifies it. When Elasticsearch use the _id field to find a document from Lucene, Lucene have to check all the segments of the index. When the values of the _id field are very sequentially, the performance is optimizable.  

Solution

Since Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in an index, we can check whether the term fall in the range of minTerm and maxTerm, so that we can skip some useless segments as soon as possible.   This improvement is beneficial to ES read/write API and Lucene look up API.


Legacy Jira details

LUCENE-8980 by Guoqiang Jiang on Sep 16 2019, resolved Sep 26 2019

mikemccand commented 4 years ago

Please help to take a look, thanks:)

[Legacy Jira: Guoqiang Jiang on Sep 19 2019]

mikemccand commented 4 years ago

Please edit this issue description to be about what your PR does and not about other stuff that is not in scope of what you are doing in this issue.  Feel free to file another issue for other stuff if you like.  FWIW I find it preferable for issue descriptions to be focused solution oriented and leave comments to add commentary, benchmarks, etc.

I plan to commit this nice improvement soon thereafter.  Unless you say otherwise, in the CHANGES.txt I'll list you as your name displays here in JIRA, and the commit author will be as you used in the PR.

[Legacy Jira: David Smiley (@dsmiley) on Sep 25 2019]

mikemccand commented 4 years ago

Tests

We have made some write benchmark with _id values in UUID V1 format, and the write performance of Elasticsearch is as follows:

Branch Write speed after 4h CPU cost Overall improvement Write speed after 8h CPU cost Overall improvement
Original Lucene 29.9w/s 68.4% N/A 26.7w/s 66.6% N/A
Optimised Lucene 34.5w/s
(+15.4%)
63.8
(-6.7%)
+22.1% 31.5w/s
(18.0%)
61.5
(-7.7%)
+25.7%

As shown above, after 8 hours of continuous writing, write speed improves by 18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. The search API of Elasticsearch will also take benefit of this improvement.

It should be noted that the benchmark test needs to be run several hours continuously, because the performance improvements is not obvious when the data is completely cached or the number of segments is too small.

[Legacy Jira: Guoqiang Jiang on Sep 26 2019]

mikemccand commented 4 years ago

We have done more performance test using luceneutil tool. And the complete test results are here.

The lueneutil tool repeatedly execute the wikimedium10k 20 times. The following table is the result of the last run. As shown in the table below, most of the indicators are basically stable, while the PKLookup indicator has a performance improvement of 58.7%. 

TaskQPS baseline StdDevQPS my_modified_version StdDev Pct_diff(percent_diff)
HighIntervalsOrdered 303.36 (12.5%) 283.86 (16.9%) -6.4%(-31% - 26%)
MedPhrase 404.26 (12.3%) 382.64 (10.5%) -5.3%(-25% - 19%)
LowTerm 2302.28 (8.7%) 2180.74 (11.8%) -5.3%(-23% - 16%)
AndHighMed 618.78 (10.1%) 586.61 (11.8%) -5.2%(-24% - 18%)
BrowseDayOfYearSSDVFacets 1042.68 (10.1%) 992.82 (10.7%) -4.8%(-23% - 17%)
HighSpanNear 263.62 (12.9%) 256.07 (14.9%) -2.9%(-27% - 28%)
Wildcard 221.10 (16.2%) 215.32 (11.9%) -2.6%(-26% - 30%)
LowSpanNear 656.60 (7.9%) 639.77 (11.3%) -2.6%(-20% - 18%)
Fuzzy1 135.61 (9.1%) 132.26 (10.4%) -2.5%(-20% - 18%)
AndHighHigh 409.88 (10.9%) 399.79 (12.6%) -2.5%(-23% - 23%)
OrHighHigh 318.45 (12.9%) 312.43 (12.2%) -1.9%(-23% - 26%)
AndHighLow 937.17 (10.2%) 921.71 (11.4%) -1.6%(-21% - 22%)
LowPhrase 385.06 (12.3%) 379.83 (10.8%) -1.4%(-21% - 24%)
IntNRQ 618.69 (14.1%) 610.58 (10.6%) -1.3%(-22% - 27%)
HighTermMonthSort 1178.14 (9.5%) 1164.48 (12.6%) -1.2%(-21% - 23%)
Fuzzy2 46.95 (16.2%) 46.57 (15.6%) -0.8%(-28% - 36%)
OrHighLow 633.64 (9.6%) 629.21 (9.9%) -0.7%(-18% - 20%)
BrowseMonthSSDVFacets 1157.34 (12.1%) 1155.63 (13.5%) -0.1%(-23% - 29%)
Prefix3 297.40 (12.1%) 298.16 (12.7%) 0.3%(-21% - 28%)
MedSpanNear 434.56 (10.0%) 437.02 (11.4%) 0.6%(-19% - 24%)
MedTerm 2158.68 (8.8%) 2177.67 (11.1%) 0.9%(-17% - 22%)
HighSloppyPhrase 320.36 (10.0%) 323.46 (14.6%) 1.0%(-21% - 28%)
BrowseDateTaxoFacets 2065.89 (13.7%) 2088.22 (13.2%) 1.1%(-22% - 32%)
Respell 187.05 (12.2%) 189.48 (10.1%) 1.3%(-18% - 26%)
MedSloppyPhrase 583.45 (11.3%) 592.32 (9.9%) 1.5%(-17% - 25%)
HighTerm 1114.87 (12.0%) 1131.89 (12.8%) 1.5%(-20% - 29%)
HighTermDayOfYearSort 408.17 (13.1%) 416.13 (9.3%) 1.9%(-18% - 27%)
BrowseDayOfYearTaxoFacets 5460.05 (8.5%) 5591.96 (8.0%) 2.4%(-13% - 20%)
BrowseMonthTaxoFacets 5490.18 (8.0%) 5654.03 (9.3%) 3.0%(-13% - 22%)
LowSloppyPhrase 562.96 (10.1%) 583.91 (9.5%) 3.7%(-14% - 25%)
HighPhrase 221.20 (11.9%) 229.85 (12.2%) 3.9%(-17% - 31%)
OrHighMed 352.09 (12.3%) 369.39 (9.4%) 4.9%(-14% - 30%)
PKLookup 85.19 (18.1%) 135.38 (22.7%) 58.9%( 15% - 121%)

[Legacy Jira: Guoqiang Jiang on Sep 26 2019]

mikemccand commented 4 years ago

We run another test case wikimedium10m to verify the improvement on a large data set. The complete results are here. The following table is the result of the last run:

TaskQPS baseline StdDevQPS my_modified_version StdDev Pct_diff(percent_diff)

-----
:----: :-----: :---------------: :----: :------------------:
OrHighNotLow 293.93 (5.8%) 286.46 (6.6%) -2.5%(-14% - 10%)
OrHighNotHigh 258.18 (3.7%) 252.41 (5.0%) -2.2%(-10% - 6%)
OrHighLow 206.52 (6.2%) 202.55 (6.2%) -1.9%(-13% - 11%)
MedPhrase 16.41 (4.1%) 16.12 (2.6%) -1.7%( -8% - 5%)
LowTerm 608.71 (5.7%) 599.21 (4.4%) -1.6%(-10% - 9%)
Prefix3 37.96 (2.8%) 37.51 (3.8%) -1.2%( -7% - 5%)
OrNotHighHigh 255.49 (5.5%) 252.63 (6.1%) -1.1%(-12% - 11%)
MedSloppyPhrase 13.71 (3.5%) 13.58 (3.7%) -1.0%( -7% - 6%)
HighSloppyPhrase 17.00 (3.3%) 16.84 (3.7%) -0.9%( -7% - 6%)
OrHighHigh 19.02 (2.6%) 18.85 (2.7%) -0.9%( -6% - 4%)
MedTerm 564.56 (4.6%) 559.38 (2.9%) -0.9%( -8% - 6%)
OrNotHighLow 294.29 (4.9%) 291.86 (4.2%) -0.8%( -9% - 8%)
AndHighLow 303.17 (3.7%) 300.72 (4.5%) -0.8%( -8% - 7%)
AndHighHigh 28.24 (2.1%) 28.01 (2.7%) -0.8%( -5% - 4%)
Wildcard 64.64 (3.9%) 64.21 (4.0%) -0.7%( -8% - 7%)
HighSpanNear 15.14 (2.8%) 15.04 (2.5%) -0.7%( -5% - 4%)
HighTerm 431.22 (3.9%) 428.68 (2.9%) -0.6%( -7% - 6%)
LowSloppyPhrase 19.29 (2.2%) 19.18 (2.9%) -0.6%( -5% - 4%)
LowSpanNear 64.32 (2.3%) 63.99 (2.0%) -0.5%( -4% - 3%)
Fuzzy2 34.51 (12.8%) 34.34 (11.9%) -0.5%(-22% - 27%)
MedSpanNear 51.51 (2.3%) 51.28 (1.6%) -0.4%( -4% - 3%)
HighTermDayOfYearSort 51.45 (6.6%) 51.24 (7.5%) -0.4%(-13% - 14%)
OrHighNotMed 306.95 (5.1%) 306.03 (3.2%) -0.3%( -8% - 8%)
BrowseDateTaxoFacets 1.48 (0.6%) 1.47 (1.2%) -0.2%( -1% - 1%)
BrowseMonthSSDVFacets 6.15 (1.1%) 6.14 (3.6%) -0.2%( -4% - 4%)
HighPhrase 186.86 (6.2%) 186.64 (3.7%) -0.1%( -9% - 10%)
Respell 48.69 (4.1%) 48.65 (4.0%) -0.1%( -7% - 8%)
AndHighMed 65.66 (3.0%) 65.74 (3.2%) 0.1%( -5% - 6%)
HighIntervalsOrdered 6.68 (1.5%) 6.69 (1.7%) 0.1%( -3% - 3%)
LowPhrase 219.11 (5.7%) 220.24 (3.5%) 0.5%( -8% - 10%)
OrHighMed 68.05 (4.5%) 68.44 (3.1%) 0.6%( -6% - 8%)
OrNotHighMed 272.89 (5.7%) 274.77 (4.1%) 0.7%( -8% - 11%)
IntNRQ 37.58 (23.8%) 37.96 (24.2%) 1.0%(-37% - 64%)
BrowseDayOfYearSSDVFacets 5.34 (4.2%) 5.40 (2.9%) 1.2%( -5% - 8%)
HighTermMonthSort 34.82 (11.7%) 35.81 (14.9%) 2.9%(-21% - 33%)
BrowseMonthTaxoFacets 4781.41 (3.9%) 4931.19 (2.7%) 3.1%( -3% - 10%)
Fuzzy1 35.98 (9.7%) 37.42 (8.0%) 4.0%(-12% - 23%)
BrowseDayOfYearTaxoFacets 4688.64 (3.6%) 4878.52 (3.6%) 4.0%( -3% - 11%)
PKLookup 72.93 (4.7%) 95.23 (3.3%) 30.6%( 21% - 40%)

[Legacy Jira: Guoqiang Jiang on Sep 26 2019]

mikemccand commented 4 years ago

Hi, @dsmiley, thanks for your suggestion. I have updated the description and comments.

Please help to  commit this improvement. Thanks again.

[Legacy Jira: Guoqiang Jiang on Sep 26 2019]

mikemccand commented 4 years ago

Commit 99f4cec459177caeb16644e4592d807d125c1613 in lucene-solr's branch refs/heads/master from johngqjiang https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=99f4cec

LUCENE-8980: Blocktree seekExact now checks min-max range of the segment

[Legacy Jira: ASF subversion and git services on Sep 26 2019]

mikemccand commented 4 years ago

Commit 4df2702cdbf2195ddc5e8623231d903f6908e693 in lucene-solr's branch refs/heads/branch_8x from johngqjiang https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4df2702

LUCENE-8980: Blocktree seekExact now checks min-max range of the segment

(cherry picked from commit 99f4cec459177caeb16644e4592d807d125c1613)

[Legacy Jira: ASF subversion and git services on Sep 26 2019]

mikemccand commented 4 years ago

Thanks for contributing and your benchmarking!

[Legacy Jira: David Smiley (@dsmiley) on Sep 26 2019]