Add tasks for multiple negated keywords and its optimized version

mikemccand / luceneutil

Various utility scripts for running Lucene performance tests

Apache License 2.0

203 stars 114 forks source link

Add tasks for multiple negated keywords and its optimized version #258

Open shubhamvishu opened 8 months ago

shubhamvishu commented 8 months ago

Added 2 new tasks for multi keyword negated queries for performance comparison.

OrNegatedHighHigh : (-A -B -C -D)
OrNegatedHighHighOpt : -(A B C D)

UPDATE : See in the comments below for new updated results

Benchmark results :

                         TaskQPS          baseline    StdDevQPS my_modified_version      StdDev        Pct diff p-value
               OrNegatedHighHigh     3700.39      (4.5%)     3704.18      (4.6%)    0.1% (  -8% -    9%) 0.943
            OrNegatedHighHighOpt     8036.98      (7.8%)     8133.66      (6.6%)    1.2% ( -12% -   16%) 0.598

Observation : OrNegatedHighHighOpt is more than 100% faster than OrNegatedHighHigh or conversely, the current implementation is >50% slower.

Full results :

                         TaskQPS          baseline    StdDevQPS my_modified_version      StdDev        Pct diff p-value
            BrowseDateSSDVFacets        1.00     (10.3%)        0.98      (5.9%)   -2.3% ( -16% -   15%) 0.385
                      AndHighLow      606.52      (2.5%)      601.62      (3.2%)   -0.8% (  -6% -    5%) 0.373
           BrowseMonthTaxoFacets        8.06     (32.7%)        8.00     (32.1%)   -0.8% ( -49% -   95%) 0.940
                         Respell       34.40      (1.7%)       34.26      (1.9%)   -0.4% (  -3% -    3%) 0.494
                      HighPhrase      141.76      (4.1%)      141.25      (4.5%)   -0.4% (  -8% -    8%) 0.788
                      AndHighMed       43.20      (2.3%)       43.07      (2.4%)   -0.3% (  -4% -    4%) 0.682
                       OrHighMed       82.32      (3.1%)       82.09      (3.2%)   -0.3% (  -6% -    6%) 0.772
                       OrHighLow      419.44      (2.5%)      418.32      (2.5%)   -0.3% (  -5% -    4%) 0.733
                       LowPhrase       17.50      (3.0%)       17.45      (2.7%)   -0.3% (  -5% -    5%) 0.773
        AndHighHighDayTaxoFacets        6.50      (2.4%)        6.49      (2.1%)   -0.2% (  -4% -    4%) 0.738
         AndHighMedDayTaxoFacets       19.27      (1.8%)       19.24      (1.4%)   -0.2% (  -3% -    3%) 0.699
                     AndHighHigh       28.91      (4.2%)       28.88      (4.4%)   -0.1% (  -8% -    8%) 0.946
                      TermDTSort       97.46      (0.8%)       97.39      (0.9%)   -0.1% (  -1% -    1%) 0.785
                          Fuzzy1       60.18      (0.9%)       60.14      (0.9%)   -0.1% (  -1% -    1%) 0.856
                    HighSpanNear        4.07      (2.1%)        4.07      (1.9%)   -0.1% (  -3% -    4%) 0.936
           HighTermDayOfYearSort      209.79      (2.4%)      209.69      (2.9%)   -0.0% (  -5% -    5%) 0.954
                        PKLookup      119.60      (2.3%)      119.60      (1.5%)   -0.0% (  -3% -    3%) 0.995
                     MedSpanNear        8.10      (2.7%)        8.10      (2.6%)    0.0% (  -5% -    5%) 0.984
                      OrHighHigh       28.27      (7.0%)       28.28      (7.0%)    0.0% ( -13% -   15%) 0.990
     BrowseRandomLabelTaxoFacets        3.25      (0.4%)        3.25      (0.6%)    0.0% (   0% -    0%) 0.836
           BrowseMonthSSDVFacets        3.25      (1.2%)        3.26      (1.0%)    0.0% (  -2% -    2%) 0.923
                          IntNRQ       42.02      (0.8%)       42.04      (0.7%)    0.0% (  -1% -    1%) 0.865
            BrowseDateTaxoFacets        3.85      (0.4%)        3.85      (0.4%)    0.1% (   0% -    0%) 0.470
       BrowseDayOfYearTaxoFacets        3.91      (0.5%)        3.91      (0.4%)    0.1% (   0% -    1%) 0.518
               OrNegatedHighHigh     3700.39      (4.5%)     3704.18      (4.6%)    0.1% (  -8% -    9%) 0.943
                       MedPhrase       27.70      (3.0%)       27.73      (3.1%)    0.1% (  -5% -    6%) 0.901
                HighSloppyPhrase        4.31      (2.8%)        4.32      (2.6%)    0.1% (  -5% -    5%) 0.881
                         Prefix3      164.53      (1.6%)      164.76      (1.4%)    0.1% (  -2% -    3%) 0.772
                     LowSpanNear       23.95      (3.7%)       23.99      (3.7%)    0.2% (  -6% -    7%) 0.890
            HighTermTitleBDVSort        7.10      (2.9%)        7.11      (2.5%)    0.2% (  -5% -    5%) 0.845
                        Wildcard       57.24      (2.5%)       57.35      (3.2%)    0.2% (  -5% -    6%) 0.823
                          Fuzzy2       27.23      (1.7%)       27.29      (1.5%)    0.2% (  -2% -    3%) 0.654
            HighIntervalsOrdered        4.08      (2.4%)        4.09      (3.0%)    0.3% (  -4% -    5%) 0.735
                 LowSloppyPhrase       50.72      (2.9%)       50.88      (2.4%)    0.3% (  -4% -    5%) 0.721
       BrowseDayOfYearSSDVFacets        3.17      (1.5%)        3.18      (0.9%)    0.3% (  -2% -    2%) 0.422
             MedIntervalsOrdered        3.05      (1.7%)        3.06      (2.0%)    0.3% (  -3% -    4%) 0.586
               HighTermTitleSort      134.59      (1.9%)      135.05      (2.4%)    0.3% (  -3% -    4%) 0.625
                 MedSloppyPhrase        5.20      (3.1%)        5.22      (2.7%)    0.4% (  -5% -    6%) 0.676
                    OrNotHighLow      390.59      (2.0%)      392.65      (2.0%)    0.5% (  -3% -    4%) 0.410
            MedTermDayTaxoFacets       17.31      (3.5%)       17.41      (3.5%)    0.6% (  -6% -    7%) 0.617
             LowIntervalsOrdered        1.92      (3.0%)        1.93      (3.6%)    0.6% (  -5% -    7%) 0.599
          OrHighMedDayTaxoFacets        3.02      (4.4%)        3.04      (4.4%)    0.6% (  -7% -    9%) 0.658
                    OrNotHighMed      232.65      (2.1%)      234.52      (3.2%)    0.8% (  -4% -    6%) 0.348
                   OrHighNotHigh      272.03      (3.0%)      274.37      (4.0%)    0.9% (  -5% -    8%) 0.442
     BrowseRandomLabelSSDVFacets        2.29      (1.0%)        2.31      (5.1%)    1.0% (  -5% -    7%) 0.387
                         LowTerm      317.60      (2.9%)      320.85      (3.0%)    1.0% (  -4% -    7%) 0.279
                        HighTerm      268.36      (5.3%)      271.18      (5.1%)    1.0% (  -8% -   12%) 0.524
               HighTermMonthSort     2405.80      (3.7%)     2432.43      (4.0%)    1.1% (  -6% -    9%) 0.366
                   OrNotHighHigh      442.00      (3.2%)      447.16      (3.5%)    1.2% (  -5% -    8%) 0.268
            OrNegatedHighHighOpt     8036.98      (7.8%)     8133.66      (6.6%)    1.2% ( -12% -   16%) 0.598
                         MedTerm      389.70      (4.3%)      394.98      (4.3%)    1.4% (  -6% -   10%) 0.321
                    OrHighNotLow      265.18      (3.1%)      268.92      (5.0%)    1.4% (  -6% -    9%) 0.281
                    OrHighNotMed      260.05      (3.6%)      265.02      (4.4%)    1.9% (  -5% -   10%) 0.132

I'm working on fixing this with BQ.rewrite. Will open a issue/PR in Lucene once I have something concrete or I'll create one straight away. We can use the added tasks in this PR to test if the rewrite change is making the performance of queries with many negated keywords comparable to its optimized version.

@mikemccand : If this change looks good, maybe we could also add negated query tasks to other tasks files other than for wikimediumall?

jpountz commented 8 months ago

I'm unclear of your goal here. Since these queries only have negated clauses, they do not match anything. Are you focusing on the rewrite() cost?

shubhamvishu commented 8 months ago

I'm unclear of your goal here. Since these queries only have negated clauses, they do not match anything. Are you focusing on the rewrite() cost?

One of our engineers in Amazon Product Search found that replacing should clause having disjunctive negated keywords eg: (-A -B -C -D) in BQ as (-(A B C D)) is much faster. I ran these benchmarks to confirm that finding and see if we can optimize the rewrite for disjunctive negated keywords in SHOULD clauses(?). Let me know if it makes sense or if I'm missing something maybe ?

Since these queries only have negated clauses, they do not match anything.

@jpountz Do you mean to update the queries to something like (-field:some *:*) (-field:text *:*) (non optimized) to (-(+field:some +field:text) *:*) (optimized) in the tasks ?

shubhamvishu commented 8 months ago

Ok, while checking on the benchmark sanity I found a mistake with the test I performed. For OrNegatedHighHigh queries, we should instead use queries like (-A) (-B) or (-A) | (-B) and not -A -B . The luceneutil benchmarks seems to be using the classic QueryParser which parses -A -B as MUST_NOT clauses whereas we expected it to be disjunction of SHOULD clauses. Here is how QueryParser#parse maps a negated queries to BQ :

QueryParser#parse("(-date) | (-he)") : (-field:date) (-field:he)
QueryParser#parse("(-date) (-he)") : (-field:date) (-field:he)
QueryParser#parse("(-date | -he)") : -field:date -field:he
QueryParser#parse("-date -he") : -field:date -field:he
QueryParser#parse("-(+date +he)") : -(+field:date +field:he)
QueryParser#parse("-(date he)") : -(field:date field:he)

Here we need to test the performance of queries 1 with 5 or 2 with 5. I'll update the query task file and post the new results shortly. Looking for thoughts if this makes sense?

shubhamvishu commented 8 months ago

I ran the benchmarks again with modified unoptimized queries (just changed -A -B -> (-A) (-B)). Benchmarks shows optimized version queries are still quite faster. All the queries had 2 disjuncted negated keywords, maybe increasing the cardinality here we should see more regression with unoptimized version?

Run 1 : ~6-8% faster

               TaskQPS  baseline      StdDevQPS my_modified_version      StdDev     Pct diff p-value
   OrNegatedHighHigh     7935.53      (7.8%)     8032.81      (8.3%)    1.2% ( -13% -   18%) 0.630
OrNegatedHighHighOpt     8459.21     (10.6%)     8680.91      (9.6%)    2.6% ( -15% -   25%) 0.413

Run 2 : ~11-14% faster

             TaskQPS    baseline      StdDevQPS my_modified_version      StdDev     Pct diff p-value
   OrNegatedHighHigh     7532.05      (7.5%)     7696.09      (9.2%)    2.2% ( -13% -   20%) 0.412
OrNegatedHighHighOpt     8812.66      (7.1%)     8712.71      (8.2%)   -1.1% ( -15% -   15%) 0.639

Run 3: ~12-13% faster

              TaskQPS    baseline      StdDevQPS my_modified_version      StdDev         Pct diff p-value
   OrNegatedHighHigh     7696.69      (6.9%)     7695.40      (7.5%)   -0.0% ( -13% -   15%) 0.994
OrNegatedHighHighOpt     8754.10      (8.2%)     8799.83      (8.9%)    0.5% ( -15% -   19%) 0.846

msokolov commented 7 months ago

Am I misreading your benchmark results? The Pct diff looks as if it in the -1 - 2% range and the p-values indicate no significant differences. For significance, we'd expect to see p-values < 0.05.

jpountz commented 7 months ago

@jpountz Do you mean to update the queries to something like (-field:some :) (-field:text :) (non optimized) to (-(+field:some +field:text) :) (optimized) in the tasks ?

I was more thinking of writing non optimized queries as *:* -field:some -field:text and optimized queries as *:* -(field:some field:text). This is the case that you are interested in benchmarking I believe: multiple negated clauses vs a single negated clause that is a disjunction?

mikemccand commented 7 months ago

Thanks for digging on this @shubhamvishu!

+1 to get a *:* (MatchAllDocsQuery) into each query, or maybe a required (conjunctive) HighTerm for a more realistic test?

I thought Lucene would not even run pure negated queries anymore, so I'm confused how the benchy is even running, on either the baseline or the candidate.

Also, note that the QPS in your latest run are absurdly high (7K or 8K QPS) and not really believable/trustworthy. I'd rather see at most maybe a few hundred QPS :)

shubhamvishu commented 7 months ago

Am I misreading your benchmark results? The Pct diff looks as if it in the -1 - 2% range and the p-values indicate no significant differences. For significance, we'd expect to see p-values < 0.05.

@msokolov Here I'm only comparing the raw QPS numbers for 2 tasks so the baseline and candidate are exactly same in one task or row (2 different representations of one query is executed by the 2 tasks). Hence, we could safely ignore p-value or pct-diff.

OrNegatedHighHigh - Non optimized query OrNegatedHighHighOpt - Optimized or rewritten query

shubhamvishu commented 7 months ago

I was more thinking of writing non optimized queries as : -field:some -field:text and optimized queries as : -(field:some field:text).

This also seem like one option and I ran benchmarks for this as well but rewriting this way into a single negated clause does not seem to help in the benchmarks. Will post the results shortly.

This is the case that you are interested in benchmarking I believe: multiple negated clauses vs a single negated clause that is a disjunction?

@jpountz I was thinking of the opposite case actually i.e. rewriting multiple NOT clauses under disjunctive clauses having only SHOULD or NOT clauses into a single negated clause that is a conjuctive. i.e.

(-A *:*) (-B *:*) -> -(+A +B) *:* (improves performance as per initial benchmarks)
-A -B *:* -> -(A B) *:* (degrades performance as per initial benchmarks)

shubhamvishu commented 7 months ago

I randomly sampled n terms (i.e. 2, 4, 6, 8) from the And* and Or* tasks so it include a good combination of all freq terms to run the benchmark against the two mentioned 2 rewrite possibilities. If req. I'll rerun it with only the HighTerm ones as @mikemccand mentioned since currently it includes all sorts of terms(low, med & high).

Results for :

(-A *:*) (-B *:*) -> -(+A +B) *:* - Performance improves drastically (I think could we trust these results?)

                    TaskQPS      baseline      StdDevQPS my_modified_version      StdDev        Pct diff p-value

             OrNegated2Terms        5.16     (15.0%)        5.16     (16.2%)   -0.1% ( -27% -   36%) 0.985
          OrNegated2TermsOpt     2680.67      (4.3%)     2658.35      (3.9%)   -0.8% (  -8% -    7%) 0.525   
             OrNegated4Terms        2.91     (16.0%)        2.90     (17.1%)   -0.2% ( -28% -   39%) 0.967
          OrNegated4TermsOpt     1766.87      (4.3%)     1783.57      (3.4%)    0.9% (  -6% -    8%) 0.437
             OrNegated6Terms        2.11     (17.3%)        2.11     (18.4%)   -0.2% ( -30% -   42%) 0.968
          OrNegated6TermsOpt     1441.13      (3.6%)     1456.27      (2.7%)    1.1% (  -5% -    7%) 0.294
             OrNegated8Terms        1.58     (17.0%)        1.58     (18.1%)   -0.3% ( -30% -   41%) 0.961
          OrNegated8TermsOpt     1087.37      (2.3%)     1088.95      (2.8%)    0.1% (  -4% -    5%) 0.857

-A -B *:* -> -(A B) *:* - Performance degrades

                        TaskQPS      baseline      StdDevQPS my_modified_version      StdDev        Pct diff p-value

                AndNegated2Terms     3133.71      (3.8%)     3163.54      (4.2%)    1.0% (  -6% -    9%) 0.457
             AndNegated2TermsOpt     2387.24      (3.3%)     2418.85      (2.3%)    1.3% (  -4% -    7%) 0.141
                AndNegated4Terms     2152.25      (2.6%)     2158.65      (3.4%)    0.3% (  -5% -    6%) 0.757
             AndNegated4TermsOpt     1749.57      (2.0%)     1785.41      (3.2%)    2.0% (  -3% -    7%) 0.014
                AndNegated6Terms     1548.31      (3.2%)     1545.62      (2.5%)   -0.2% (  -5% -    5%) 0.850
             AndNegated6TermsOpt     1385.68      (3.0%)     1411.57      (2.8%)    1.9% (  -3% -    7%) 0.039
                AndNegated8Terms     1282.64      (2.5%)     1281.78      (2.7%)   -0.1% (  -5% -    5%) 0.935
             AndNegated8TermsOpt     1184.57      (2.3%)     1189.02      (2.7%)    0.4% (  -4% -    5%) 0.636

mikemccand commented 7 months ago

I'm still not happy with the 2-3 K QPS :) Something seems amiss. The benchmark writes detailed "results" files for each iteration -- can you peek at those and confirm your two forms of the same query are in fact getting the same top N hits? That would be a quick sanity check.

It might be that BMW is kicking in, in some cases and not others? Maybe we can forcefully disable BMW if so -- we are trying to test the full iteration cost here (e.g. a query that wants to do facet counting)?

I think your case 2 above is a more "typical" comparison. Case 1 is causing MatchAllDocsQuery to run N times on the left, but once on the right (not really a fair comparison), vs case 2 which runs that query once on both baseline and competitor. But, still, 1-2K QPS results is not trustworthy -- something is up.

jpountz commented 7 months ago

I'm still not happy with the 2-3 K QPS :) Something seems amiss

I suspect it's because the only scoring query produces a constant score. So in practice, Lucene will stop processing the query as soon as it collect 1,000 hits (IndexSearcher.TOTAL_HITS_THRESHOLD).

mikemccand commented 7 months ago

I'm still not happy with the 2-3 K QPS :) Something seems amiss

I suspect it's because the only scoring query produces a constant score. So in practice, Lucene will stop processing the query as soon as it collect 1,000 hits (IndexSearcher.TOTAL_HITS_THRESHOLD).

Thanks @jpountz -- that makes sense. Since we are trying to measure whether the intersection / union cost of BooleanQuery is cheaper with a single negated sub-clause, would it make sense to disable BMW for this benchy? It'd more accurately measure that full cost ... and would be somewhat realistic for apps that are doing aggregations/faceting and need to visit all hits? Maybe we need to make this easier to do in luceneutil...

shubhamvishu commented 7 months ago

I did a grep in benchmarks logs. Here is what I got :

grep "TASK: cat=OrNegated2Terms " logs/* | uniq

logs/baseline_vs_patch.baseline.0:TASK: cat=OrNegated2Terms q=(-body:editing *:*) (-body:four *:*) s=null group=null hits=6005973+ facets=[]
logs/baseline_vs_patch.baseline.1:TASK: cat=OrNegated2Terms q=(-body:editing *:*) (-body:four *:*) s=null group=null hits=6005973+ facets=[]
""
""

grep "TASK: cat=OrNegated2TermsOpt " logs/* | uniq

logs/baseline_vs_patch.baseline.0:TASK: cat=OrNegated2TermsOpt q=-(+body:external +body:28) *:* s=null group=null hits=1001+ facets=[]
logs/baseline_vs_patch.baseline.1:TASK: cat=OrNegated2TermsOpt q=-(+body:external +body:28) *:* s=null group=null hits=1001+ facets=[]
""
""

grep "TASK: cat=AndNegated2Terms " logs/* | uniq

logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated2Terms q=-body:birth -body:john *:* s=null group=null hits=1001+ facets=[]
logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated2Terms q=-body:birth -body:john *:* s=null group=null hits=1001+ facets=[]
""
""

grep "TASK: cat=AndNegated2TermsOpt " logs/* | uniq

logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated2TermsOpt q=-(body:infantry body:title) *:* s=null group=null hits=1001+ facets=[]
logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated2TermsOpt q=-(body:infantry body:title) *:* s=null group=null hits=1001+ facets=[]
""
""

The benchmark writes detailed "results" files for each iteration -- can you peek at those and confirm your two forms of the same query are in fact getting the same top N hits? That would be a quick sanity check.

@mikemccand I see one of the queries for OrNegated2Terms had hits=6005973+ compared to a query in OrNegated2TermsOpt that had hits=1001+. So seems we are stopping early for OrNegated2TermsOpt after collecting 1000 hits as @jpountz mentioned. Also I only see one query task in the logs, so couldn't compare 2 same queries against each other. Do we log only 1 query per task in the benchmark logs?

Case 1 is causing MatchAllDocsQuery to run N times on the left, but once on the right (not really a fair comparison)

My understanding here was we are trying to rewrite the query in such a way 2 queries produces the same result but is more optimized(we could confirm the overlap numbers). So why do we still care about running MatchAllDocsQuery N times vs 1 time incase if they are give identical results? or we don't?

mikemccand commented 7 months ago

The benchmark writes detailed "results" files for each iteration -- can you peek at those and confirm your two forms of the same query are in fact getting the same top N hits? That would be a quick sanity check.

@mikemccand I see one of the queries for OrNegated2Terms had hits=6005973+ compared to a query in OrNegated2TermsOpt that had hits=1001+. So seems we are stopping early for OrNegated2TermsOpt after collecting 1000 hits as @jpountz mentioned. Also I only see one query task in the logs, so couldn't compare 2 same queries against each other. Do we log only 1 query per task in the benchmark logs?

Indeed, the 1001+ means BMW applied (see this nice blog post about it, thank you @jpountz!).

So this means your first test ((-A *:*) (-B *:*) -> -(+A +B) *:*) is also testing "BMW applies" against "BMW does not apply", which is maybe not what we want to test? Or, I suppose, rewriting a query such that BMW becomes enabled could be seen as an awesome optimization! Still, I think we should disable BMW for all runs? Let's see which query form is faster for Lucene iterating many hits?

(I do wonder why it's 1001+ and not 1000+ but probably there is a harmless ob1 somewhere, heh).

Note that it's not stopping early -- the BMW optimization uses skipping to jump through the postings blocks to find possibly competitive (might crack into the top K) hits. It is an admissible optimization, meaning it is 100% correct (unless we have exciting bugs) and will always give the same top K hits returned as if the whole index was checked.

In your case the scores are quite synthetic (just 1f from *:* query I suspect?) and so BMW likely is in fact able to stop after collecting the first 1001 hits (since we tie-break by docid). If you look in your results file you should see the scores of the K hits collected for each task.

Also I only see one query task in the logs, so couldn't compare 2 same queries against each other. Do we log only 1 query per task in the benchmark logs?

Hmm that's odd. Each specific query should have been executed N times by M threads, so you should see N * M instances of it in each iteration's results file. Can you post one of your results files for each of baseline and candidate?

Case 1 is causing MatchAllDocsQuery to run N times on the left, but once on the right (not really a fair comparison)

My understanding here was we are trying to rewrite the query in such a way 2 queries produces the same result but is more optimized(we could confirm the overlap numbers). So why do we still care about running MatchAllDocsQuery N times vs 1 time incase if they are give identical results? or we don't?

OK indeed that is the goal here (rewrite queries to faster form). I guess I feel like case 2) is the more common real-world example? "search for stuff but filter stuff out" use case.

shubhamvishu commented 7 months ago

I suppose, rewriting a query such that BMW becomes enabled could be seen as an awesome optimization! Still, I think we should disable BMW for all runs? Let's see which query form is faster for Lucene iterating many hits?

Makes sense! Is there any straightforward way to disable BMW for luceneutil benchmarks or for a query? I couldn't find any helpful documentation around it both here and in apache lucene.

Hmm that's odd. Each specific query should have been executed N times by M threads, so you should see N * M instances of it in each iteration's results file. Can you post one of your results files for each of baseline and candidate?

@mikemccand I uploaded one log file for each baseline and candidate in the recent commit(under logs dir). Note : baseline and candidate checked out code are exactly same, since we are doing what rewrite optimization would have achieved but using 2 separate query tasks here having different opt/non-opt versions of same queries.

If we grep in the log files for the text "OrNegated2Terms " or "OrNegated2TermsOpt " I see both tasks using different query though that query indeed has NM(120=20) instances in the single log file. But what about the other queries of the task?....If I'm right there are 500 queries for each task so I expected to find all of them or alteast more but looks like there is a query(random out of 500?) picked for each iteration(out of 20) and runs 20(N*M) times i.e. total 20 different queries?. This is why wasn't able to compare 1-1 for a single query. Very likely, I might be misinterpreting something here.

OK indeed that is the goal here (rewrite queries to faster form). I guess I feel like case 2) is the more common real-world example? "search for stuff but filter stuff out" use case.

I agree! 2nd case looks more common use case compare to 1st.

shubhamvishu commented 6 months ago

@mikemccand @jpountz

I ran benchmarks again and this time increased TOTAL_HITS_THRESHOLD in IndexSearcher.java to Integer.MAX_VALUE to disable BMW and not restrict counting only 1000 hits. I see performance is improved in case of optimized queries (i.e. *TermsOpt tasks). Also pasted the logs snippet to confirm the total hits are indeed similar and not restricted to 1000 like earlier run.

                         TaskQPS       baseline   StdDevQPS my_modified_version      StdDev    Pct diff p-value

                 OrNegated2Terms        1.74     (10.3%)        1.73      (9.7%)   -0.3% ( -18% -   21%) 0.917
              OrNegated2TermsOpt        5.85     (17.0%)        6.03     (17.1%)    3.0% ( -26% -   44%) 0.575
                 OrNegated4Terms        1.15     (12.0%)        1.15     (11.2%)   -0.1% ( -20% -   26%) 0.969
              OrNegated4TermsOpt        5.90     (17.4%)        6.09     (17.3%)    3.1% ( -26% -   45%) 0.576 
                 OrNegated6Terms        0.81     (11.7%)        0.81     (10.7%)   -0.1% ( -20% -   25%) 0.974
              OrNegated6TermsOpt        5.93     (17.4%)        6.12     (17.4%)    3.2% ( -26% -   46%) 0.558
                 OrNegated8Terms        0.68     (13.6%)        0.68     (12.7%)   -0.1% ( -23% -   30%) 0.980
              OrNegated8TermsOpt        5.94     (17.5%)        6.13     (17.5%)    3.2% ( -27% -   46%) 0.564

                AndNegated2Terms        4.22     (11.1%)        4.29     (11.7%)    1.6% ( -19% -   27%) 0.653
             AndNegated2TermsOpt        4.60     (12.4%)        4.67     (12.9%)    1.7% ( -21% -   30%) 0.679
                AndNegated4Terms        2.07      (4.8%)        2.09      (5.6%)    1.1% (  -8% -   12%) 0.522
             AndNegated4TermsOpt        2.51      (6.1%)        2.54      (6.8%)    1.3% ( -10% -   15%) 0.534
                AndNegated6Terms        2.03      (4.8%)        2.05      (5.6%)    1.0% (  -8% -   11%) 0.524
             AndNegated6TermsOpt        2.58      (6.3%)        2.61      (6.9%)    1.3% ( -11% -   15%) 0.524
                AndNegated8Terms        2.06      (4.7%)        2.09      (5.4%)    1.1% (  -8% -   11%) 0.485
             AndNegated8TermsOpt        2.29      (5.4%)        2.32      (6.1%)    1.2% (  -9% -   13%) 0.506

grep "TASK: cat=OrNegated2Terms " logs/* | uniq

baseline_vs_patch.baseline.0:TASK: cat=OrNegated2Terms q=(-body:eric *:*) (-body:kansas *:*) s=null group=null hits=33331640 facets=[]
baseline_vs_patch.baseline.1:TASK: cat=OrNegated2Terms q=(-body:eric *:*) (-body:kansas *:*) s=null group=null hits=33331640 facets=[]

grep "TASK: cat=OrNegated2TermsOpt " ~/luceneutil-repo/logs/* | uniq

baseline_vs_patch.baseline.0:TASK: cat=OrNegated2TermsOpt q=-(+body:middle +body:area_code) *:* s=null group=null hits=33331952 facets=[]
baseline_vs_patch.baseline.1:TASK: cat=OrNegated2TermsOpt q=-(+body:middle +body:area_code) *:* s=null group=null hits=33331952 facets=[]

grep "TASK: cat=AndNegated2Terms " ~/luceneutil-repo/logs/* | uniq

baseline_vs_patch.baseline.0:TASK: cat=AndNegated2Terms q=-body:history -body:mountain *:* s=null group=null hits=30947732 facets=[]
baseline_vs_patch.baseline.1:TASK: cat=AndNegated2Terms q=-body:history -body:mountain *:* s=null group=null hits=30947732 facets=[]

grep "TASK: cat=AndNegated2TermsOpt " ~/luceneutil-repo/logs/* | uniq

baseline_vs_patch.baseline.0:TASK: cat=AndNegated2TermsOpt q=-(body:ben body:26) *:* s=null group=null hits=31603551 facets=[]
baseline_vs_patch.baseline.1:TASK: cat=AndNegated2TermsOpt q=-(body:ben body:26) *:* s=null group=null hits=31603551 facets=[]

mikemccand commented 6 months ago

These gains indeed look correct (identical hit counts from X and XOpt tasks) and significant, especially for the N-term OR cases! Thanks @shubhamvishu.

I thought we had a Lucene issue to explore having Lucene rewrite BQ clauses e.g. -A -B -C -> -(A B C)? Can you link it here?

mikemccand commented 6 months ago

Actually, I'm not sure how the OrNegatedNTerms rewrite case would work? It's a strange query e.g. q=(-body:eric *:*) (-body:kansas *:*). I'm not sure this really happens in practice very often?

Maybe we first explore rewriting the AndNegatedNTerms case?

shubhamvishu commented 6 months ago

So I changed the new tasks to count only tasks as it was pointed out as alternative in #265 and pushed the tasks file I used and the logs file. For AndNegatedNTerms case, I had changed the tasks from example count(-most -september *:*) (non optimized) to count(-(most september)) *:* (optimized) and the I see no improvement with this rewriting. Though in the less realistic OrNegatedNTerms case improvements are constants across like in earlier runs . @mikemccand does this way makes more sense or we could stick to earlier results we saw when increasing TOTAL_HITS_THRESHOLD in IndexSearcher.java to Integer.MAX_VALUE ?

Note : I tried writing `count(-most -september :)task ascount(-(most september) :)but that seem to do what we want and results in 0 results. so I had to take the:part out ofcount()`

                  TaskQPS             baseline     StdDevQPS my_modified_version      StdDev   Pct diff p-value

                 OrNegated2Terms        1.51      (5.6%)        1.47      (5.6%)   -2.5% ( -12% -    9%) 0.160
              OrNegated2TermsOpt       80.08     (18.9%)       77.57     (21.3%)   -3.1% ( -36% -   45%) 0.622
                 OrNegated4Terms        3.35     (26.9%)        3.26     (25.6%)   -2.7% ( -43% -   68%) 0.749
              OrNegated4TermsOpt       73.31     (17.8%)       71.27     (20.3%)   -2.8% ( -34% -   42%) 0.645
                 OrNegated6Terms        1.40     (18.3%)        1.37     (16.4%)   -2.7% ( -31% -   39%) 0.622
              OrNegated6TermsOpt      326.74     (55.5%)      303.54     (54.4%)   -7.1% ( -75% -  231%) 0.683
                 OrNegated8Terms        0.60     (14.7%)        0.58     (13.6%)   -4.0% ( -28% -   28%) 0.372
              OrNegated8TermsOpt      105.69     (24.5%)      102.33     (26.7%)   -3.2% ( -43% -   63%) 0.695

                AndNegated2Terms      377.72     (59.5%)      351.59     (58.9%)   -6.9% ( -78% -  275%) 0.712
             AndNegated2TermsOpt        4.51      (4.1%)        4.50      (4.0%)   -0.3% (  -8% -    8%) 0.835
                AndNegated4Terms        3.91      (3.7%)        3.89      (3.6%)   -0.5% (  -7% -    6%) 0.649
             AndNegated4TermsOpt        4.14      (3.8%)        4.13      (3.3%)   -0.3% (  -7% -    7%) 0.789
                AndNegated6Terms        3.79      (3.5%)        3.79      (3.0%)   -0.1% (  -6% -    6%) 0.911
             AndNegated6TermsOpt        2.18      (3.0%)        2.17      (2.7%)   -0.3% (  -5% -    5%) 0.724
                AndNegated8Terms        7.06      (3.4%)        7.04      (3.3%)   -0.4% (  -6% -    6%) 0.721
             AndNegated8TermsOpt        3.08      (3.3%)        3.07      (2.9%)   -0.2% (  -6% -    6%) 0.817

grep "TASK: cat=OrNegated2Terms " logs/* | uniq

logs/baseline_vs_patch.baseline.0:TASK: cat=OrNegated2Terms q=-body:basic *:* (-body:you *:*) countOnlyCount=33127647
logs/baseline_vs_patch.baseline.1:TASK: cat=OrNegated2Terms q=-body:basic *:* (-body:you *:*) countOnlyCount=33127647

grep "TASK: cat=OrNegated2TermsOpt " ~/luceneutil-repo/logs/* | uniq

/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=OrNegated2TermsOpt q=-(+body:677 +body:him) *:* countOnlyCount=33332481
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=OrNegated2TermsOpt q=-(+body:677 +body:him) *:* countOnlyCount=33332481

grep "TASK: cat=AndNegated2Terms " ~/luceneutil-repo/logs/* | uniq

/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated2Terms q=-body:id -body:former *:* countOnlyCount=29975741
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated2Terms q=-body:id -body:former *:* countOnlyCount=29975741

grep "TASK: cat=AndNegated2TermsOpt " ~/luceneutil-repo/logs/* | uniq

/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated2TermsOpt q=-(body:old body:26) *:* countOnlyCount=30686467
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated2TermsOpt q=-(body:old body:26) *:* countOnlyCount=30686467

grep "TASK: cat=AndNegated4Terms " ~/luceneutil-repo/logs/* | uniq

/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated4Terms q=-body:dreamed -body:film -body:23 -body:may *:* countOnlyCount=27339654
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated4Terms q=-body:dreamed -body:film -body:23 -body:may *:* countOnlyCount=27339654

grep "TASK: cat=AndNegated4TermsOpt " ~/luceneutil-repo/logs/* | uniq

/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated4TermsOpt q=-(body:had body:5 body:can body:budget) *:* countOnlyCount=25254461
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated4TermsOpt q=-(body:had body:5 body:can body:budget) *:* countOnlyCount=25254461

shubhamvishu commented 6 months ago

I thought we had a Lucene issue to explore having Lucene rewrite BQ clauses e.g. -A -B -C -> -(A B C)? Can you link it here?

I don't remember coming across any open issue in Lucene queue for this. I was thinking of doing that once we consolidate the findings here based on the benchmarks results and then pursue it in Lucene. Maybe should I open one right away?

Maybe we first explore rewriting the AndNegatedNTerms case?

If the above approach of disabling BMW using count() seems right then I wonder we are not seeing any gains in the AndNegatedNTerms case, but with the earlier approach of maxing out TOTAL_HITS_THRESHOLD in IndexSearcher.java to Integer.MAX_VALUE the results look more reliable and less noisy to me. Looking for suggestions here.

mikemccand commented 5 months ago

Note : I tried writing count(-most -september :) task as count(-(most september) :) but that seem to do what we want and results in 0 results. so I had to take the : part out of count()

Did you mean that seemed NOT to do what we want? Likely this is a bug in luceneutil's parsing of the query? count(-(most september) *:*) should produce non-zero hits. Can you try debugging the luceneutil parsing of this query -- e.g. print the resulting Query that's passed to IndexSearcher.count and see if it's correct?

But also in your grep output it seems like you did not take the *:* out of the AndNegatedNTermsOpt tasks? I'm confused...

Maybe we first explore rewriting the AndNegatedNTerms case?

If the above approach of disabling BMW using count() seems right then I wonder we are not seeing any gains in the AndNegatedNTerms case, but with the earlier approach of maxing out TOTAL_HITS_THRESHOLD in IndexSearcher.java to Integer.MAX_VALUE the results look more reliable and less noisy to me. Looking for suggestions here.

Well, count() is dangerous because Lucene has sub-linear optimizations it may do (e.g. the high QPS for AndNegated2Terms above looks like such an count opto is applying). Increasing TOTAL_HITS_THRESHOLD (temporarily for benchmarking) is more of a guarantee that Lucene must iterate all the postings, so I trust these results more.

Can you post one of your results files for each of baseline and candidate? I'm still confused why you see only one instance of each task in the benchmark output files... and could you also post you perf.by script and the command-line you're using to invoke it? Thanks.

shubhamvishu commented 4 months ago

Did you mean that seemed NOT to do what we want? Likely this is a bug in luceneutil's parsing of the query? count(-(most september) :) should produce non-zero hits. Can you try debugging the luceneutil parsing of this query -- e.g. print the resulting Query that's passed to IndexSearcher.count and see if it's correct?

@mikemccand Yes, it seems to be doing wrong with the way the task is parsed. I checked the query paseed to IndexSearcher#count for task : count(-(most september) *:*) is -(body:2001 body:left *:*) which is incorrect as the MatchAllDocsQuery clause is moved within the negated clause which is leading to zero hits.

But also in your grep output it seems like you did not take the : out of the AndNegatedNTermsOpt tasks? I'm confused...

Here is how the a task is mapped to Lucene query and its equivalent hits :

	Task	Lucene Query	Hits (`countOnlyCount` from logs)
1.	`AndNegated2Terms: count(-most -september :)`	`-body:most -body:september :`	29781756
2.	`AndNegated2TermsOpt: count(-(most september)) :`	`-(body:press body:non) :`	31064896
3.	`ZeroHitsAndNegated2TermsOpt: count(-(most september) :)`	`-(body:2001 body:left :)`	0

The issue here seems to be with the ZeroHitsAndNegated2TermsOpt as it moves the *:* within negated clause. I have added tasks for ZeroHitsAndNegated2TermsOpt case in the new commit if incase someone wants to take a look.Ideally we want the behaviour of 2nd row i.e. AndNegated2TermsOpt to test the rewrite optimization but since 3. was not working as expected I went ahead with 2. and moved the *:* out of count(). If this makes sense and 3. indeed seems wrong, maybe we should open an issue to fix that?

Increasing TOTAL_HITS_THRESHOLD (temporarily for benchmarking) is more of a guarantee that Lucene must iterate all the postings, so I trust these results more.

Great! So we could hope some QPS improvement with this rewrite change I guess. If the testing so far seems convincing enough, I could go ahead and prepare a PR for this in Apache Lucene with the benchmarking methodology of maxing out IndexSearcher.TOTAL_HITS_THRESHOLD against the tasks added in this commit (Task file : wikimedium.10M.nostopwords.tasks). Any thoughts? @mikemccand @jpountz

Can you post one of your results files for each of baseline and candidate? I'm still confused why you see only one instance of each task in the benchmark output files... and could you also post you perf.by script and the command-line you're using to invoke it? Thanks.

I have also attached both the logs file in the latest commit on this PR (its under the folder logs). Yeah, I'm not sure why I still see only one occurrence in the logs as I'm running vanilla benchmarks and not changed anything in benchmarks code.

could you also post you perf.by script and the command-line you're using to invoke it?

I'm running the benchmarks using python3 src/python/localrun.py -source wikimediumall along with the changes in this PR(i.e. the new added tasks). I'm using the vanilla localrun.py(I suppose you mean this by perf.py) and haven't changed anything in the benchmarks except the newly added tasks.

shubhamvishu commented 3 months ago

Hi @mikemccand @jpountz , I'm looking for review on the above results. Any thoughts on these results?

shubhamvishu commented 2 months ago

Looking for some thoughts here. cc - @mikemccand @jpountz