Add matches to exact PhraseQuery and MultiPhraseQuery [LUCENE-8249]

mikemccand commented 6 years ago

ExactPhraseScorer can be rejigged fairly easily to expose a MatchesIterator

Legacy Jira details

LUCENE-8249 by Alan Woodward (@romseygeek) on Apr 12 2018, resolved May 10 2018 Attachments: LUCENE-8249.patch (versions: 5)

mikemccand commented 6 years ago

Here's a patch. It involves some minor refactoring of ExactPhraseScorer, so I'll benchmark things and make sure that it hasn't slowed anything down.

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 12 2018]

mikemccand commented 6 years ago

Here's the benchmark. Looks like noise to me...

 TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
                Wildcard     1714.40      (4.9%)     1622.18      (6.1%)   -5.4% ( -15% -    5%)
                  Fuzzy2       76.05     (23.2%)       72.75     (20.2%)   -4.3% ( -38% -   50%)
              HighPhrase      889.67      (6.3%)      862.68     (10.8%)   -3.0% ( -18% -   15%)
   HighTermDayOfYearSort     1068.15      (6.0%)     1037.75      (8.1%)   -2.8% ( -15% -   11%)
               OrHighLow     2594.37      (7.1%)     2524.49      (4.7%)   -2.7% ( -13% -    9%)
         MedSloppyPhrase      902.87      (3.4%)      885.25      (4.1%)   -2.0% (  -9% -    5%)
                HighTerm     4674.62      (6.2%)     4588.81      (7.0%)   -1.8% ( -14% -   12%)
              AndHighMed     1068.36      (4.8%)     1050.54      (5.4%)   -1.7% ( -11% -    8%)
                  IntNRQ     1422.34      (3.8%)     1404.61      (3.7%)   -1.2% (  -8% -    6%)
              AndHighLow     2882.77      (8.0%)     2846.86      (8.7%)   -1.2% ( -16% -   16%)
               OrHighMed     1306.57      (7.5%)     1290.96      (5.4%)   -1.2% ( -13% -   12%)
               LowPhrase     1326.18      (4.7%)     1311.07      (4.4%)   -1.1% (  -9% -    8%)
              OrHighHigh     1073.53      (4.8%)     1062.65      (3.7%)   -1.0% (  -9% -    7%)
         LowSloppyPhrase     1018.38      (2.5%)     1009.19      (4.6%)   -0.9% (  -7% -    6%)
                 Respell      366.92      (6.4%)      363.89      (9.0%)   -0.8% ( -15% -   15%)
       HighTermMonthSort     3319.32      (6.1%)     3307.84      (6.2%)   -0.3% ( -11% -   12%)
             AndHighHigh     1023.89      (5.3%)     1020.61      (5.4%)   -0.3% ( -10% -   10%)
             LowSpanNear     1309.70      (5.7%)     1305.85      (8.1%)   -0.3% ( -13% -   14%)
        HighSloppyPhrase      664.01      (3.7%)      663.67      (5.2%)   -0.1% (  -8% -    9%)
                PKLookup      374.03      (5.5%)      373.96      (6.3%)   -0.0% ( -11% -   12%)
                  Fuzzy1      386.18      (4.1%)      388.21      (5.2%)    0.5% (  -8% -   10%)
             MedSpanNear     1217.07      (4.4%)     1226.70      (5.7%)    0.8% (  -8% -   11%)
            HighSpanNear      641.59     (18.0%)      648.32     (22.7%)    1.0% ( -33% -   50%)
               MedPhrase      999.35      (3.1%)     1011.72      (5.7%)    1.2% (  -7% -   10%)
                 LowTerm     5747.36      (7.2%)     5835.87      (6.4%)    1.5% ( -11% -   16%)
                 MedTerm     5195.74      (5.2%)     5290.94      (6.4%)    1.8% (  -9% -   14%)
                 Prefix3      587.71      (8.6%)      601.34      (9.5%)    2.3% ( -14% -   22%)

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 12 2018]

mikemccand commented 6 years ago

I don't think we should modify ExactPhraseScorer to expose positions. Scorers are already quite complex because of two-phase iteration or the skipping of low-scoring documents, I would prefer that we duplicate the logic between the scorer and the matches iterator rather than having a single class that handles both.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 12 2018]

mikemccand commented 6 years ago

I'm not sure this makes things that much more complex? It essentially just splits the phraseFreq() method into two (reset() and nextMatch()), and then adds a method which provides a view of the Scorer as a MatchesIterator which uses the new methods directly rather than calling phraseFreq().

And duplicating the logic of ExactPhraseScorer won't be too hard, but SloppyPhraseScorer is 600 lines long...

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 12 2018]

mikemccand commented 6 years ago

I don't like mixing scorers, which are about documents with matches which are about positions. This change is making the same class responsible for iterating over both documents and positions. Even if this might look ok today, I don't like that future refactorings about the scorer will have to worry about keeping this class working for positions or vice-versa. To be clear I'm not against sharing code between the scorer and the matches iterator, only having the same class doing both.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 12 2018]

mikemccand commented 6 years ago

Here's an updated patch:

matching is moved out to a new abstract class PhraseMatcher, with Exact and Sloppy implementations
PhraseWeight is abstracted from PhraseQuery and MultiPhraseQuery, and handles explanations, matches and scorers (this has the nice side-effect of adding min-score handling to sloppy phrases)

I'm running benchmarks now

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 16 2018]

mikemccand commented 6 years ago

Looks like a small penalty on sloppy phrases, and a slightly less small boost on exact phrases. Or possibly just noise.

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
        HighSloppyPhrase      589.78      (5.4%)      573.54      (7.9%)   -2.8% ( -15% -   11%)
              OrHighHigh     1174.38      (8.6%)     1146.22      (7.5%)   -2.4% ( -17% -   15%)
         MedSloppyPhrase     1328.47      (4.3%)     1302.80      (5.2%)   -1.9% ( -10% -    7%)
              AndHighLow     3138.65      (8.5%)     3087.05      (7.2%)   -1.6% ( -15% -   15%)
             LowSpanNear     1962.66      (5.3%)     1931.60      (5.8%)   -1.6% ( -12% -   10%)
                 Prefix3     1027.12      (7.8%)     1011.50      (8.0%)   -1.5% ( -16% -   15%)
                Wildcard     1842.34      (5.8%)     1821.58      (4.2%)   -1.1% ( -10% -    9%)
                PKLookup      392.44      (4.6%)      388.12      (4.6%)   -1.1% (  -9% -    8%)
   HighTermDayOfYearSort     1122.38      (6.2%)     1111.20      (7.3%)   -1.0% ( -13% -   13%)
                HighTerm     4343.88      (8.5%)     4316.70      (5.9%)   -0.6% ( -13% -   14%)
                  IntNRQ     1319.13      (2.5%)     1313.00      (2.4%)   -0.5% (  -5% -    4%)
               OrHighLow     2157.05      (4.2%)     2148.60      (4.9%)   -0.4% (  -9% -    9%)
       HighTermMonthSort     3568.59      (5.9%)     3563.38      (5.7%)   -0.1% ( -11% -   12%)
               OrHighMed     1276.34     (11.4%)     1274.61     (11.2%)   -0.1% ( -20% -   25%)
               LowPhrase     1567.69      (4.7%)     1567.03      (5.5%)   -0.0% (  -9% -   10%)
                 MedTerm     5682.98      (8.2%)     5685.03      (9.3%)    0.0% ( -16% -   19%)
             AndHighHigh     1020.12      (4.6%)     1023.48      (4.7%)    0.3% (  -8% -   10%)
         LowSloppyPhrase      885.26      (4.4%)      889.20      (5.2%)    0.4% (  -8% -   10%)
              AndHighMed     1287.27      (6.0%)     1296.46      (5.0%)    0.7% (  -9% -   12%)
                  Fuzzy1      493.78      (4.4%)      497.65      (2.9%)    0.8% (  -6% -    8%)
                  Fuzzy2       83.87     (20.0%)       85.02     (18.4%)    1.4% ( -30% -   49%)
                 Respell      391.63      (4.6%)      397.30      (4.1%)    1.4% (  -6% -   10%)
                 LowTerm     6098.16      (6.0%)     6202.87      (5.4%)    1.7% (  -9% -   13%)
            HighSpanNear      773.18     (10.9%)      786.87      (8.4%)    1.8% ( -15% -   23%)
             MedSpanNear      937.52      (6.1%)      960.49      (4.2%)    2.4% (  -7% -   13%)
              HighPhrase     1035.86      (3.8%)     1101.79      (4.9%)    6.4% (  -2% -   15%)
               MedPhrase      997.89      (7.2%)     1068.68      (5.0%)    7.1% (  -4% -   20%)

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 16 2018]

mikemccand commented 6 years ago

I like this approach better. I have some questions/notes:

The implementation of MatchesIterator#term feels wrong. Should remove that method or make it return a list of terms?
I think PhraseMatcher#minFreq() should actually be called maxFreq, since it tries to give an upper bound of the frequency of this phrase?
Is the implementation of SloppyPhraseMatcher#minFreq right? A single term position can be used for two phrase positions with slops? For instance I suspect that if you search for a phrase of two synonyms with a large slop, you end up with a phrase frequency that is larger than the frequency of any synonym? Maybe we should also make this a float rather than an int, it might have potential to be greater than an int?
I'm wondering whether we can simplify ExactPhraseMatcher to not have its lead iterator being one position ahead all the time? This would eg. allow to remove the if (exposeOffset) trick to remember the start offset.
Should we try to implement freq() on top of the matchers? It's a bit weird that it consumes the iterator when called.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 18 2018]

mikemccand commented 6 years ago

Here's an updated patch:

minFreq() is now maxFreq()
I fixed SloppyPhraseMatcher#maxFreq to be the sum of the freqs of each child postings, on the grounds that each term position can be the first term in at most one match. +1 on changing to a float, I'd missed that comment before uploading the new patch, will change it now.
ExactPhraseMatcher now advances its lead in sync with everything else, which simplifies things a lot.
I removed freq(), and replaced it with sloppyWeight() which returns the contribution of the current match to the total sloppy freq - suggestions for a better name are welcomed...
I originally put in MatchesIterator#term so that highlighters could do things like keep track of the number of specific terms in a fragment, for scoring. I like the idea of changing it to return a BytesRef[] though, let's do that in a followup.

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 20 2018]

mikemccand commented 6 years ago

Thanks for the update, I'll have another look.

I like the idea of changing it to return a BytesRef[] though, let's do that in a followup.

Can we change the API first? I wouldn't want one of our main queries to get a hacky implementation of this API, even temporarily.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 20 2018]

mikemccand commented 6 years ago

Updated patch, now that MatchesIterator#term() is gone. I also changed maxFreq() to return a float, and added a comment to SloppyPhraseMatcher to explain how the max freq is calculated.

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 24 2018]

mikemccand commented 6 years ago

I spoke to Adrien elsewhere and have added some comments to SloppyPhraseMatcher. I've also removed the exposeOffsets boolean as it wasn't protecting anything expensive, and was confusingly named.

[Legacy Jira: Alan Woodward (@romseygeek) on May 09 2018]

mikemccand commented 6 years ago

+1

[Legacy Jira: Adrien Grand (@jpountz) on May 09 2018]

mikemccand commented 6 years ago

Commit 6705632810f700e81ab04a9516ad42feb35aaf64 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6705632

LUCENE-8249: Implement Matches API for phrase queries

[Legacy Jira: ASF subversion and git services on May 10 2018]

mikemccand commented 6 years ago

Commit 7d9059ef13df70838da18161a91ef4328a3ce82f in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d9059e

LUCENE-8249: Implement Matches API for phrase queries

[Legacy Jira: ASF subversion and git services on May 10 2018]

mikemccand / stargazers-migration-test

Add matches to exact PhraseQuery and MultiPhraseQuery [LUCENE-8249] #249

Legacy Jira details