Optimise SegmentTermsEnum.seekExact performance [LUCENE-8980]

mikemccand commented 4 years ago

Description

In Elasticsearch, which is based on Lucene, each document has an indexed _id field that uniquely identifies it. When Elasticsearch use the _id field to find a document from Lucene, Lucene have to check all the segments of the index. When the values of the _id field are very sequentially, the performance is optimizable.

Solution

Since Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in an index, we can check whether the term fall in the range of minTerm and maxTerm, so that we can skip some useless segments as soon as possible. This improvement is beneficial to ES read/write API and Lucene look up API.

Legacy Jira details

LUCENE-8980 by Guoqiang Jiang on Sep 16 2019, resolved Sep 26 2019

mikemccand commented 4 years ago

Please help to take a look, thanks:)

[Legacy Jira: Guoqiang Jiang on Sep 19 2019]

mikemccand commented 4 years ago

Please edit this issue description to be about what your PR does and not about other stuff that is not in scope of what you are doing in this issue. Feel free to file another issue for other stuff if you like. FWIW I find it preferable for issue descriptions to be focused solution oriented and leave comments to add commentary, benchmarks, etc.

I plan to commit this nice improvement soon thereafter. Unless you say otherwise, in the CHANGES.txt I'll list you as your name displays here in JIRA, and the commit author will be as you used in the PR.

[Legacy Jira: David Smiley (@dsmiley) on Sep 25 2019]

mikemccand commented 4 years ago

Tests

We have made some write benchmark with _id values in UUID V1 format, and the write performance of Elasticsearch is as follows:

Branch	Write speed after 4h	CPU cost	Overall improvement	Write speed after 8h	CPU cost	Overall improvement
Original Lucene	29.9w/s	68.4%	N/A	26.7w/s	66.6%	N/A
Optimised Lucene	34.5w/s (+15.4%)	63.8 (-6.7%)	+22.1%	31.5w/s (18.0%)	61.5 (-7.7%)	+25.7%

As shown above, after 8 hours of continuous writing, write speed improves by 18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. The search API of Elasticsearch will also take benefit of this improvement.

It should be noted that the benchmark test needs to be run several hours continuously, because the performance improvements is not obvious when the data is completely cached or the number of segments is too small.

[Legacy Jira: Guoqiang Jiang on Sep 26 2019]

mikemccand commented 4 years ago

We have done more performance test using luceneutil tool. And the complete test results are here.

The lueneutil tool repeatedly execute the wikimedium10k 20 times. The following table is the result of the last run. As shown in the table below, most of the indicators are basically stable, while the PKLookup indicator has a performance improvement of 58.7%.

TaskQPS	baseline	StdDevQPS	my_modified_version	StdDev	Pct_diff(percent_diff)
HighIntervalsOrdered	303.36	(12.5%)	283.86	(16.9%)	-6.4%(-31% - 26%)
MedPhrase	404.26	(12.3%)	382.64	(10.5%)	-5.3%(-25% - 19%)
LowTerm	2302.28	(8.7%)	2180.74	(11.8%)	-5.3%(-23% - 16%)
AndHighMed	618.78	(10.1%)	586.61	(11.8%)	-5.2%(-24% - 18%)
BrowseDayOfYearSSDVFacets	1042.68	(10.1%)	992.82	(10.7%)	-4.8%(-23% - 17%)
HighSpanNear	263.62	(12.9%)	256.07	(14.9%)	-2.9%(-27% - 28%)
Wildcard	221.10	(16.2%)	215.32	(11.9%)	-2.6%(-26% - 30%)
LowSpanNear	656.60	(7.9%)	639.77	(11.3%)	-2.6%(-20% - 18%)
Fuzzy1	135.61	(9.1%)	132.26	(10.4%)	-2.5%(-20% - 18%)
AndHighHigh	409.88	(10.9%)	399.79	(12.6%)	-2.5%(-23% - 23%)
OrHighHigh	318.45	(12.9%)	312.43	(12.2%)	-1.9%(-23% - 26%)
AndHighLow	937.17	(10.2%)	921.71	(11.4%)	-1.6%(-21% - 22%)
LowPhrase	385.06	(12.3%)	379.83	(10.8%)	-1.4%(-21% - 24%)
IntNRQ	618.69	(14.1%)	610.58	(10.6%)	-1.3%(-22% - 27%)
HighTermMonthSort	1178.14	(9.5%)	1164.48	(12.6%)	-1.2%(-21% - 23%)
Fuzzy2	46.95	(16.2%)	46.57	(15.6%)	-0.8%(-28% - 36%)
OrHighLow	633.64	(9.6%)	629.21	(9.9%)	-0.7%(-18% - 20%)
BrowseMonthSSDVFacets	1157.34	(12.1%)	1155.63	(13.5%)	-0.1%(-23% - 29%)
Prefix3	297.40	(12.1%)	298.16	(12.7%)	0.3%(-21% - 28%)
MedSpanNear	434.56	(10.0%)	437.02	(11.4%)	0.6%(-19% - 24%)
MedTerm	2158.68	(8.8%)	2177.67	(11.1%)	0.9%(-17% - 22%)
HighSloppyPhrase	320.36	(10.0%)	323.46	(14.6%)	1.0%(-21% - 28%)
BrowseDateTaxoFacets	2065.89	(13.7%)	2088.22	(13.2%)	1.1%(-22% - 32%)
Respell	187.05	(12.2%)	189.48	(10.1%)	1.3%(-18% - 26%)
MedSloppyPhrase	583.45	(11.3%)	592.32	(9.9%)	1.5%(-17% - 25%)
HighTerm	1114.87	(12.0%)	1131.89	(12.8%)	1.5%(-20% - 29%)
HighTermDayOfYearSort	408.17	(13.1%)	416.13	(9.3%)	1.9%(-18% - 27%)
BrowseDayOfYearTaxoFacets	5460.05	(8.5%)	5591.96	(8.0%)	2.4%(-13% - 20%)
BrowseMonthTaxoFacets	5490.18	(8.0%)	5654.03	(9.3%)	3.0%(-13% - 22%)
LowSloppyPhrase	562.96	(10.1%)	583.91	(9.5%)	3.7%(-14% - 25%)
HighPhrase	221.20	(11.9%)	229.85	(12.2%)	3.9%(-17% - 31%)
OrHighMed	352.09	(12.3%)	369.39	(9.4%)	4.9%(-14% - 30%)
PKLookup	85.19	(18.1%)	135.38	(22.7%)	58.9%( 15% - 121%)

[Legacy Jira: Guoqiang Jiang on Sep 26 2019]

mikemccand commented 4 years ago

We run another test case wikimedium10m to verify the improvement on a large data set. The complete results are here. The following table is the result of the last run:

TaskQPS	baseline	StdDevQPS	my_modified_version	StdDev	Pct_diff(percent_diff)
-----	:----:	:-----:	:---------------:	:----:	:------------------:
OrHighNotLow	293.93	(5.8%)	286.46	(6.6%)	-2.5%(-14% - 10%)
OrHighNotHigh	258.18	(3.7%)	252.41	(5.0%)	-2.2%(-10% - 6%)
OrHighLow	206.52	(6.2%)	202.55	(6.2%)	-1.9%(-13% - 11%)
MedPhrase	16.41	(4.1%)	16.12	(2.6%)	-1.7%( -8% - 5%)
LowTerm	608.71	(5.7%)	599.21	(4.4%)	-1.6%(-10% - 9%)
Prefix3	37.96	(2.8%)	37.51	(3.8%)	-1.2%( -7% - 5%)
OrNotHighHigh	255.49	(5.5%)	252.63	(6.1%)	-1.1%(-12% - 11%)
MedSloppyPhrase	13.71	(3.5%)	13.58	(3.7%)	-1.0%( -7% - 6%)
HighSloppyPhrase	17.00	(3.3%)	16.84	(3.7%)	-0.9%( -7% - 6%)
OrHighHigh	19.02	(2.6%)	18.85	(2.7%)	-0.9%( -6% - 4%)
MedTerm	564.56	(4.6%)	559.38	(2.9%)	-0.9%( -8% - 6%)
OrNotHighLow	294.29	(4.9%)	291.86	(4.2%)	-0.8%( -9% - 8%)
AndHighLow	303.17	(3.7%)	300.72	(4.5%)	-0.8%( -8% - 7%)
AndHighHigh	28.24	(2.1%)	28.01	(2.7%)	-0.8%( -5% - 4%)
Wildcard	64.64	(3.9%)	64.21	(4.0%)	-0.7%( -8% - 7%)
HighSpanNear	15.14	(2.8%)	15.04	(2.5%)	-0.7%( -5% - 4%)
HighTerm	431.22	(3.9%)	428.68	(2.9%)	-0.6%( -7% - 6%)
LowSloppyPhrase	19.29	(2.2%)	19.18	(2.9%)	-0.6%( -5% - 4%)
LowSpanNear	64.32	(2.3%)	63.99	(2.0%)	-0.5%( -4% - 3%)
Fuzzy2	34.51	(12.8%)	34.34	(11.9%)	-0.5%(-22% - 27%)
MedSpanNear	51.51	(2.3%)	51.28	(1.6%)	-0.4%( -4% - 3%)
HighTermDayOfYearSort	51.45	(6.6%)	51.24	(7.5%)	-0.4%(-13% - 14%)
OrHighNotMed	306.95	(5.1%)	306.03	(3.2%)	-0.3%( -8% - 8%)
BrowseDateTaxoFacets	1.48	(0.6%)	1.47	(1.2%)	-0.2%( -1% - 1%)
BrowseMonthSSDVFacets	6.15	(1.1%)	6.14	(3.6%)	-0.2%( -4% - 4%)
HighPhrase	186.86	(6.2%)	186.64	(3.7%)	-0.1%( -9% - 10%)
Respell	48.69	(4.1%)	48.65	(4.0%)	-0.1%( -7% - 8%)
AndHighMed	65.66	(3.0%)	65.74	(3.2%)	0.1%( -5% - 6%)
HighIntervalsOrdered	6.68	(1.5%)	6.69	(1.7%)	0.1%( -3% - 3%)
LowPhrase	219.11	(5.7%)	220.24	(3.5%)	0.5%( -8% - 10%)
OrHighMed	68.05	(4.5%)	68.44	(3.1%)	0.6%( -6% - 8%)
OrNotHighMed	272.89	(5.7%)	274.77	(4.1%)	0.7%( -8% - 11%)
IntNRQ	37.58	(23.8%)	37.96	(24.2%)	1.0%(-37% - 64%)
BrowseDayOfYearSSDVFacets	5.34	(4.2%)	5.40	(2.9%)	1.2%( -5% - 8%)
HighTermMonthSort	34.82	(11.7%)	35.81	(14.9%)	2.9%(-21% - 33%)
BrowseMonthTaxoFacets	4781.41	(3.9%)	4931.19	(2.7%)	3.1%( -3% - 10%)
Fuzzy1	35.98	(9.7%)	37.42	(8.0%)	4.0%(-12% - 23%)
BrowseDayOfYearTaxoFacets	4688.64	(3.6%)	4878.52	(3.6%)	4.0%( -3% - 11%)
PKLookup	72.93	(4.7%)	95.23	(3.3%)	30.6%( 21% - 40%)

[Legacy Jira: Guoqiang Jiang on Sep 26 2019]

mikemccand commented 4 years ago

Hi, @dsmiley, thanks for your suggestion. I have updated the description and comments.

Please help to commit this improvement. Thanks again.

[Legacy Jira: Guoqiang Jiang on Sep 26 2019]

mikemccand commented 4 years ago

Commit 99f4cec459177caeb16644e4592d807d125c1613 in lucene-solr's branch refs/heads/master from johngqjiang https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=99f4cec

LUCENE-8980: Blocktree seekExact now checks min-max range of the segment

[Legacy Jira: ASF subversion and git services on Sep 26 2019]

mikemccand commented 4 years ago

Commit 4df2702cdbf2195ddc5e8623231d903f6908e693 in lucene-solr's branch refs/heads/branch_8x from johngqjiang https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4df2702

LUCENE-8980: Blocktree seekExact now checks min-max range of the segment

(cherry picked from commit 99f4cec459177caeb16644e4592d807d125c1613)

[Legacy Jira: ASF subversion and git services on Sep 26 2019]

mikemccand commented 4 years ago

Thanks for contributing and your benchmarking!

[Legacy Jira: David Smiley (@dsmiley) on Sep 26 2019]

mikemccand / stargazers-migration-test

Optimise SegmentTermsEnum.seekExact performance [LUCENE-8980] #977

Legacy Jira details