Closed mikemccand closed 6 years ago
Here's a patch. It involves some minor refactoring of ExactPhraseScorer, so I'll benchmark things and make sure that it hasn't slowed anything down.
[Legacy Jira: Alan Woodward (@romseygeek) on Apr 12 2018]
Here's the benchmark. Looks like noise to me...
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
Wildcard 1714.40 (4.9%) 1622.18 (6.1%) -5.4% ( -15% - 5%)
Fuzzy2 76.05 (23.2%) 72.75 (20.2%) -4.3% ( -38% - 50%)
HighPhrase 889.67 (6.3%) 862.68 (10.8%) -3.0% ( -18% - 15%)
HighTermDayOfYearSort 1068.15 (6.0%) 1037.75 (8.1%) -2.8% ( -15% - 11%)
OrHighLow 2594.37 (7.1%) 2524.49 (4.7%) -2.7% ( -13% - 9%)
MedSloppyPhrase 902.87 (3.4%) 885.25 (4.1%) -2.0% ( -9% - 5%)
HighTerm 4674.62 (6.2%) 4588.81 (7.0%) -1.8% ( -14% - 12%)
AndHighMed 1068.36 (4.8%) 1050.54 (5.4%) -1.7% ( -11% - 8%)
IntNRQ 1422.34 (3.8%) 1404.61 (3.7%) -1.2% ( -8% - 6%)
AndHighLow 2882.77 (8.0%) 2846.86 (8.7%) -1.2% ( -16% - 16%)
OrHighMed 1306.57 (7.5%) 1290.96 (5.4%) -1.2% ( -13% - 12%)
LowPhrase 1326.18 (4.7%) 1311.07 (4.4%) -1.1% ( -9% - 8%)
OrHighHigh 1073.53 (4.8%) 1062.65 (3.7%) -1.0% ( -9% - 7%)
LowSloppyPhrase 1018.38 (2.5%) 1009.19 (4.6%) -0.9% ( -7% - 6%)
Respell 366.92 (6.4%) 363.89 (9.0%) -0.8% ( -15% - 15%)
HighTermMonthSort 3319.32 (6.1%) 3307.84 (6.2%) -0.3% ( -11% - 12%)
AndHighHigh 1023.89 (5.3%) 1020.61 (5.4%) -0.3% ( -10% - 10%)
LowSpanNear 1309.70 (5.7%) 1305.85 (8.1%) -0.3% ( -13% - 14%)
HighSloppyPhrase 664.01 (3.7%) 663.67 (5.2%) -0.1% ( -8% - 9%)
PKLookup 374.03 (5.5%) 373.96 (6.3%) -0.0% ( -11% - 12%)
Fuzzy1 386.18 (4.1%) 388.21 (5.2%) 0.5% ( -8% - 10%)
MedSpanNear 1217.07 (4.4%) 1226.70 (5.7%) 0.8% ( -8% - 11%)
HighSpanNear 641.59 (18.0%) 648.32 (22.7%) 1.0% ( -33% - 50%)
MedPhrase 999.35 (3.1%) 1011.72 (5.7%) 1.2% ( -7% - 10%)
LowTerm 5747.36 (7.2%) 5835.87 (6.4%) 1.5% ( -11% - 16%)
MedTerm 5195.74 (5.2%) 5290.94 (6.4%) 1.8% ( -9% - 14%)
Prefix3 587.71 (8.6%) 601.34 (9.5%) 2.3% ( -14% - 22%)
[Legacy Jira: Alan Woodward (@romseygeek) on Apr 12 2018]
I don't think we should modify ExactPhraseScorer to expose positions. Scorers are already quite complex because of two-phase iteration or the skipping of low-scoring documents, I would prefer that we duplicate the logic between the scorer and the matches iterator rather than having a single class that handles both.
[Legacy Jira: Adrien Grand (@jpountz) on Apr 12 2018]
I'm not sure this makes things that much more complex? It essentially just splits the phraseFreq() method into two (reset() and nextMatch()), and then adds a method which provides a view of the Scorer as a MatchesIterator which uses the new methods directly rather than calling phraseFreq().
And duplicating the logic of ExactPhraseScorer won't be too hard, but SloppyPhraseScorer is 600 lines long...
[Legacy Jira: Alan Woodward (@romseygeek) on Apr 12 2018]
I don't like mixing scorers, which are about documents with matches which are about positions. This change is making the same class responsible for iterating over both documents and positions. Even if this might look ok today, I don't like that future refactorings about the scorer will have to worry about keeping this class working for positions or vice-versa. To be clear I'm not against sharing code between the scorer and the matches iterator, only having the same class doing both.
[Legacy Jira: Adrien Grand (@jpountz) on Apr 12 2018]
Here's an updated patch:
I'm running benchmarks now
[Legacy Jira: Alan Woodward (@romseygeek) on Apr 16 2018]
Looks like a small penalty on sloppy phrases, and a slightly less small boost on exact phrases. Or possibly just noise.
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
HighSloppyPhrase 589.78 (5.4%) 573.54 (7.9%) -2.8% ( -15% - 11%)
OrHighHigh 1174.38 (8.6%) 1146.22 (7.5%) -2.4% ( -17% - 15%)
MedSloppyPhrase 1328.47 (4.3%) 1302.80 (5.2%) -1.9% ( -10% - 7%)
AndHighLow 3138.65 (8.5%) 3087.05 (7.2%) -1.6% ( -15% - 15%)
LowSpanNear 1962.66 (5.3%) 1931.60 (5.8%) -1.6% ( -12% - 10%)
Prefix3 1027.12 (7.8%) 1011.50 (8.0%) -1.5% ( -16% - 15%)
Wildcard 1842.34 (5.8%) 1821.58 (4.2%) -1.1% ( -10% - 9%)
PKLookup 392.44 (4.6%) 388.12 (4.6%) -1.1% ( -9% - 8%)
HighTermDayOfYearSort 1122.38 (6.2%) 1111.20 (7.3%) -1.0% ( -13% - 13%)
HighTerm 4343.88 (8.5%) 4316.70 (5.9%) -0.6% ( -13% - 14%)
IntNRQ 1319.13 (2.5%) 1313.00 (2.4%) -0.5% ( -5% - 4%)
OrHighLow 2157.05 (4.2%) 2148.60 (4.9%) -0.4% ( -9% - 9%)
HighTermMonthSort 3568.59 (5.9%) 3563.38 (5.7%) -0.1% ( -11% - 12%)
OrHighMed 1276.34 (11.4%) 1274.61 (11.2%) -0.1% ( -20% - 25%)
LowPhrase 1567.69 (4.7%) 1567.03 (5.5%) -0.0% ( -9% - 10%)
MedTerm 5682.98 (8.2%) 5685.03 (9.3%) 0.0% ( -16% - 19%)
AndHighHigh 1020.12 (4.6%) 1023.48 (4.7%) 0.3% ( -8% - 10%)
LowSloppyPhrase 885.26 (4.4%) 889.20 (5.2%) 0.4% ( -8% - 10%)
AndHighMed 1287.27 (6.0%) 1296.46 (5.0%) 0.7% ( -9% - 12%)
Fuzzy1 493.78 (4.4%) 497.65 (2.9%) 0.8% ( -6% - 8%)
Fuzzy2 83.87 (20.0%) 85.02 (18.4%) 1.4% ( -30% - 49%)
Respell 391.63 (4.6%) 397.30 (4.1%) 1.4% ( -6% - 10%)
LowTerm 6098.16 (6.0%) 6202.87 (5.4%) 1.7% ( -9% - 13%)
HighSpanNear 773.18 (10.9%) 786.87 (8.4%) 1.8% ( -15% - 23%)
MedSpanNear 937.52 (6.1%) 960.49 (4.2%) 2.4% ( -7% - 13%)
HighPhrase 1035.86 (3.8%) 1101.79 (4.9%) 6.4% ( -2% - 15%)
MedPhrase 997.89 (7.2%) 1068.68 (5.0%) 7.1% ( -4% - 20%)
[Legacy Jira: Alan Woodward (@romseygeek) on Apr 16 2018]
I like this approach better. I have some questions/notes:
MatchesIterator#term
feels wrong. Should remove that method or make it return a list of terms?PhraseMatcher#minFreq()
should actually be called maxFreq
, since it tries to give an upper bound of the frequency of this phrase?SloppyPhraseMatcher#minFreq
right? A single term position can be used for two phrase positions with slops? For instance I suspect that if you search for a phrase of two synonyms with a large slop, you end up with a phrase frequency that is larger than the frequency of any synonym? Maybe we should also make this a float rather than an int, it might have potential to be greater than an int?if (exposeOffset)
trick to remember the start offset.[Legacy Jira: Adrien Grand (@jpountz) on Apr 18 2018]
Here's an updated patch:
[Legacy Jira: Alan Woodward (@romseygeek) on Apr 20 2018]
Thanks for the update, I'll have another look.
I like the idea of changing it to return a BytesRef[] though, let's do that in a followup.
Can we change the API first? I wouldn't want one of our main queries to get a hacky implementation of this API, even temporarily.
[Legacy Jira: Adrien Grand (@jpountz) on Apr 20 2018]
Updated patch, now that MatchesIterator#term() is gone. I also changed maxFreq() to return a float, and added a comment to SloppyPhraseMatcher to explain how the max freq is calculated.
[Legacy Jira: Alan Woodward (@romseygeek) on Apr 24 2018]
I spoke to Adrien elsewhere and have added some comments to SloppyPhraseMatcher. I've also removed the exposeOffsets boolean as it wasn't protecting anything expensive, and was confusingly named.
[Legacy Jira: Alan Woodward (@romseygeek) on May 09 2018]
+1
[Legacy Jira: Adrien Grand (@jpountz) on May 09 2018]
Commit 6705632810f700e81ab04a9516ad42feb35aaf64 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6705632
LUCENE-8249: Implement Matches API for phrase queries
[Legacy Jira: ASF subversion and git services on May 10 2018]
Commit 7d9059ef13df70838da18161a91ef4328a3ce82f in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d9059e
LUCENE-8249: Implement Matches API for phrase queries
[Legacy Jira: ASF subversion and git services on May 10 2018]
ExactPhraseScorer can be rejigged fairly easily to expose a MatchesIterator
Legacy Jira details
LUCENE-8249 by Alan Woodward (@romseygeek) on Apr 12 2018, resolved May 10 2018 Attachments: LUCENE-8249.patch (versions: 5)