mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Add matches to exact PhraseQuery and MultiPhraseQuery [LUCENE-8249] #249

Closed mikemccand closed 6 years ago

mikemccand commented 6 years ago

ExactPhraseScorer can be rejigged fairly easily to expose a MatchesIterator


Legacy Jira details

LUCENE-8249 by Alan Woodward (@romseygeek) on Apr 12 2018, resolved May 10 2018 Attachments: LUCENE-8249.patch (versions: 5)

mikemccand commented 6 years ago

Here's a patch. It involves some minor refactoring of ExactPhraseScorer, so I'll benchmark things and make sure that it hasn't slowed anything down.

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 12 2018]

mikemccand commented 6 years ago

Here's the benchmark. Looks like noise to me...

 TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
                Wildcard     1714.40      (4.9%)     1622.18      (6.1%)   -5.4% ( -15% -    5%)
                  Fuzzy2       76.05     (23.2%)       72.75     (20.2%)   -4.3% ( -38% -   50%)
              HighPhrase      889.67      (6.3%)      862.68     (10.8%)   -3.0% ( -18% -   15%)
   HighTermDayOfYearSort     1068.15      (6.0%)     1037.75      (8.1%)   -2.8% ( -15% -   11%)
               OrHighLow     2594.37      (7.1%)     2524.49      (4.7%)   -2.7% ( -13% -    9%)
         MedSloppyPhrase      902.87      (3.4%)      885.25      (4.1%)   -2.0% (  -9% -    5%)
                HighTerm     4674.62      (6.2%)     4588.81      (7.0%)   -1.8% ( -14% -   12%)
              AndHighMed     1068.36      (4.8%)     1050.54      (5.4%)   -1.7% ( -11% -    8%)
                  IntNRQ     1422.34      (3.8%)     1404.61      (3.7%)   -1.2% (  -8% -    6%)
              AndHighLow     2882.77      (8.0%)     2846.86      (8.7%)   -1.2% ( -16% -   16%)
               OrHighMed     1306.57      (7.5%)     1290.96      (5.4%)   -1.2% ( -13% -   12%)
               LowPhrase     1326.18      (4.7%)     1311.07      (4.4%)   -1.1% (  -9% -    8%)
              OrHighHigh     1073.53      (4.8%)     1062.65      (3.7%)   -1.0% (  -9% -    7%)
         LowSloppyPhrase     1018.38      (2.5%)     1009.19      (4.6%)   -0.9% (  -7% -    6%)
                 Respell      366.92      (6.4%)      363.89      (9.0%)   -0.8% ( -15% -   15%)
       HighTermMonthSort     3319.32      (6.1%)     3307.84      (6.2%)   -0.3% ( -11% -   12%)
             AndHighHigh     1023.89      (5.3%)     1020.61      (5.4%)   -0.3% ( -10% -   10%)
             LowSpanNear     1309.70      (5.7%)     1305.85      (8.1%)   -0.3% ( -13% -   14%)
        HighSloppyPhrase      664.01      (3.7%)      663.67      (5.2%)   -0.1% (  -8% -    9%)
                PKLookup      374.03      (5.5%)      373.96      (6.3%)   -0.0% ( -11% -   12%)
                  Fuzzy1      386.18      (4.1%)      388.21      (5.2%)    0.5% (  -8% -   10%)
             MedSpanNear     1217.07      (4.4%)     1226.70      (5.7%)    0.8% (  -8% -   11%)
            HighSpanNear      641.59     (18.0%)      648.32     (22.7%)    1.0% ( -33% -   50%)
               MedPhrase      999.35      (3.1%)     1011.72      (5.7%)    1.2% (  -7% -   10%)
                 LowTerm     5747.36      (7.2%)     5835.87      (6.4%)    1.5% ( -11% -   16%)
                 MedTerm     5195.74      (5.2%)     5290.94      (6.4%)    1.8% (  -9% -   14%)
                 Prefix3      587.71      (8.6%)      601.34      (9.5%)    2.3% ( -14% -   22%)

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 12 2018]

mikemccand commented 6 years ago

I don't think we should modify ExactPhraseScorer to expose positions. Scorers are already quite complex because of two-phase iteration or the skipping of low-scoring documents, I would prefer that we duplicate the logic between the scorer and the matches iterator rather than having a single class that handles both.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 12 2018]

mikemccand commented 6 years ago

I'm not sure this makes things that much more complex? It essentially just splits the phraseFreq() method into two (reset() and nextMatch()), and then adds a method which provides a view of the Scorer as a MatchesIterator which uses the new methods directly rather than calling phraseFreq().

And duplicating the logic of ExactPhraseScorer won't be too hard, but SloppyPhraseScorer is 600 lines long...

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 12 2018]

mikemccand commented 6 years ago

I don't like mixing scorers, which are about documents with matches which are about positions. This change is making the same class responsible for iterating over both documents and positions. Even if this might look ok today, I don't like that future refactorings about the scorer will have to worry about keeping this class working for positions or vice-versa. To be clear I'm not against sharing code between the scorer and the matches iterator, only having the same class doing both.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 12 2018]

mikemccand commented 6 years ago

Here's an updated patch:

I'm running benchmarks now

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 16 2018]

mikemccand commented 6 years ago

Looks like a small penalty on sloppy phrases, and a slightly less small boost on exact phrases. Or possibly just noise.

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
        HighSloppyPhrase      589.78      (5.4%)      573.54      (7.9%)   -2.8% ( -15% -   11%)
              OrHighHigh     1174.38      (8.6%)     1146.22      (7.5%)   -2.4% ( -17% -   15%)
         MedSloppyPhrase     1328.47      (4.3%)     1302.80      (5.2%)   -1.9% ( -10% -    7%)
              AndHighLow     3138.65      (8.5%)     3087.05      (7.2%)   -1.6% ( -15% -   15%)
             LowSpanNear     1962.66      (5.3%)     1931.60      (5.8%)   -1.6% ( -12% -   10%)
                 Prefix3     1027.12      (7.8%)     1011.50      (8.0%)   -1.5% ( -16% -   15%)
                Wildcard     1842.34      (5.8%)     1821.58      (4.2%)   -1.1% ( -10% -    9%)
                PKLookup      392.44      (4.6%)      388.12      (4.6%)   -1.1% (  -9% -    8%)
   HighTermDayOfYearSort     1122.38      (6.2%)     1111.20      (7.3%)   -1.0% ( -13% -   13%)
                HighTerm     4343.88      (8.5%)     4316.70      (5.9%)   -0.6% ( -13% -   14%)
                  IntNRQ     1319.13      (2.5%)     1313.00      (2.4%)   -0.5% (  -5% -    4%)
               OrHighLow     2157.05      (4.2%)     2148.60      (4.9%)   -0.4% (  -9% -    9%)
       HighTermMonthSort     3568.59      (5.9%)     3563.38      (5.7%)   -0.1% ( -11% -   12%)
               OrHighMed     1276.34     (11.4%)     1274.61     (11.2%)   -0.1% ( -20% -   25%)
               LowPhrase     1567.69      (4.7%)     1567.03      (5.5%)   -0.0% (  -9% -   10%)
                 MedTerm     5682.98      (8.2%)     5685.03      (9.3%)    0.0% ( -16% -   19%)
             AndHighHigh     1020.12      (4.6%)     1023.48      (4.7%)    0.3% (  -8% -   10%)
         LowSloppyPhrase      885.26      (4.4%)      889.20      (5.2%)    0.4% (  -8% -   10%)
              AndHighMed     1287.27      (6.0%)     1296.46      (5.0%)    0.7% (  -9% -   12%)
                  Fuzzy1      493.78      (4.4%)      497.65      (2.9%)    0.8% (  -6% -    8%)
                  Fuzzy2       83.87     (20.0%)       85.02     (18.4%)    1.4% ( -30% -   49%)
                 Respell      391.63      (4.6%)      397.30      (4.1%)    1.4% (  -6% -   10%)
                 LowTerm     6098.16      (6.0%)     6202.87      (5.4%)    1.7% (  -9% -   13%)
            HighSpanNear      773.18     (10.9%)      786.87      (8.4%)    1.8% ( -15% -   23%)
             MedSpanNear      937.52      (6.1%)      960.49      (4.2%)    2.4% (  -7% -   13%)
              HighPhrase     1035.86      (3.8%)     1101.79      (4.9%)    6.4% (  -2% -   15%)
               MedPhrase      997.89      (7.2%)     1068.68      (5.0%)    7.1% (  -4% -   20%)

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 16 2018]

mikemccand commented 6 years ago

I like this approach better. I have some questions/notes:

[Legacy Jira: Adrien Grand (@jpountz) on Apr 18 2018]

mikemccand commented 6 years ago

Here's an updated patch:

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 20 2018]

mikemccand commented 6 years ago

Thanks for the update, I'll have another look.

I like the idea of changing it to return a BytesRef[] though, let's do that in a followup.

Can we change the API first? I wouldn't want one of our main queries to get a hacky implementation of this API, even temporarily.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 20 2018]

mikemccand commented 6 years ago

Updated patch, now that MatchesIterator#term() is gone. I also changed maxFreq() to return a float, and added a comment to SloppyPhraseMatcher to explain how the max freq is calculated.

[Legacy Jira: Alan Woodward (@romseygeek) on Apr 24 2018]

mikemccand commented 6 years ago

I spoke to Adrien elsewhere and have added some comments to SloppyPhraseMatcher. I've also removed the exposeOffsets boolean as it wasn't protecting anything expensive, and was confusingly named.

[Legacy Jira: Alan Woodward (@romseygeek) on May 09 2018]

mikemccand commented 6 years ago

+1

[Legacy Jira: Adrien Grand (@jpountz) on May 09 2018]

mikemccand commented 6 years ago

Commit 6705632810f700e81ab04a9516ad42feb35aaf64 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6705632

LUCENE-8249: Implement Matches API for phrase queries

[Legacy Jira: ASF subversion and git services on May 10 2018]

mikemccand commented 6 years ago

Commit 7d9059ef13df70838da18161a91ef4328a3ce82f in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d9059e

LUCENE-8249: Implement Matches API for phrase queries

[Legacy Jira: ASF subversion and git services on May 10 2018]