Open shubhamvishu opened 8 months ago
I'm unclear of your goal here. Since these queries only have negated clauses, they do not match anything. Are you focusing on the rewrite() cost?
I'm unclear of your goal here. Since these queries only have negated clauses, they do not match anything. Are you focusing on the rewrite() cost?
One of our engineers in Amazon Product Search found that replacing should clause having disjunctive negated keywords eg: (-A -B -C -D)
in BQ as (-(A B C D))
is much faster. I ran these benchmarks to confirm that finding and see if we can optimize the rewrite for disjunctive negated keywords in SHOULD
clauses(?). Let me know if it makes sense or if I'm missing something maybe ?
Since these queries only have negated clauses, they do not match anything.
@jpountz Do you mean to update the queries to something like (-field:some *:*) (-field:text *:*)
(non optimized) to (-(+field:some +field:text) *:*)
(optimized) in the tasks ?
Ok, while checking on the benchmark sanity I found a mistake with the test I performed. For OrNegatedHighHigh
queries, we should instead use queries like (-A) (-B)
or (-A) | (-B)
and not -A -B
. The luceneutil
benchmarks seems to be using the classic QueryParser which parses -A -B
as MUST_NOT
clauses whereas we expected it to be disjunction of SHOULD
clauses. Here is how QueryParser#parse
maps a negated queries to BQ :
(-field:date) (-field:he)
(-field:date) (-field:he)
-field:date -field:he
-field:date -field:he
-(+field:date +field:he)
-(field:date field:he)
Here we need to test the performance of queries 1 with 5 or 2 with 5. I'll update the query task file and post the new results shortly. Looking for thoughts if this makes sense?
I ran the benchmarks again with modified unoptimized queries (just changed -A -B
-> (-A) (-B)
). Benchmarks shows optimized version queries are still quite faster. All the queries had 2 disjuncted negated keywords, maybe increasing the cardinality here we should see more regression with unoptimized version?
Run 1 : ~6-8%
faster
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrNegatedHighHigh 7935.53 (7.8%) 8032.81 (8.3%) 1.2% ( -13% - 18%) 0.630
OrNegatedHighHighOpt 8459.21 (10.6%) 8680.91 (9.6%) 2.6% ( -15% - 25%) 0.413
Run 2 : ~11-14%
faster
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrNegatedHighHigh 7532.05 (7.5%) 7696.09 (9.2%) 2.2% ( -13% - 20%) 0.412
OrNegatedHighHighOpt 8812.66 (7.1%) 8712.71 (8.2%) -1.1% ( -15% - 15%) 0.639
Run 3: ~12-13%
faster
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrNegatedHighHigh 7696.69 (6.9%) 7695.40 (7.5%) -0.0% ( -13% - 15%) 0.994
OrNegatedHighHighOpt 8754.10 (8.2%) 8799.83 (8.9%) 0.5% ( -15% - 19%) 0.846
Am I misreading your benchmark results? The Pct diff looks as if it in the -1 - 2% range and the p-values indicate no significant differences. For significance, we'd expect to see p-values < 0.05.
@jpountz Do you mean to update the queries to something like (-field:some :) (-field:text :) (non optimized) to (-(+field:some +field:text) :) (optimized) in the tasks ?
I was more thinking of writing non optimized queries as *:* -field:some -field:text
and optimized queries as *:* -(field:some field:text)
. This is the case that you are interested in benchmarking I believe: multiple negated clauses vs a single negated clause that is a disjunction?
Thanks for digging on this @shubhamvishu!
+1 to get a *:*
(MatchAllDocsQuery
) into each query, or maybe a required (conjunctive) HighTerm
for a more realistic test?
I thought Lucene would not even run pure negated queries anymore, so I'm confused how the benchy is even running, on either the baseline or the candidate.
Also, note that the QPS in your latest run are absurdly high (7K or 8K QPS) and not really believable/trustworthy. I'd rather see at most maybe a few hundred QPS :)
Am I misreading your benchmark results? The Pct diff looks as if it in the -1 - 2% range and the p-values indicate no significant differences. For significance, we'd expect to see p-values < 0.05.
@msokolov Here I'm only comparing the raw QPS numbers for 2 tasks so the baseline and candidate are exactly same in one task or row (2 different representations of one query is executed by the 2 tasks). Hence, we could safely ignore p-value or pct-diff.
OrNegatedHighHigh
- Non optimized query
OrNegatedHighHighOpt
- Optimized or rewritten query
I was more thinking of writing non optimized queries as : -field:some -field:text and optimized queries as : -(field:some field:text).
This also seem like one option and I ran benchmarks for this as well but rewriting this way into a single negated clause does not seem to help in the benchmarks. Will post the results shortly.
This is the case that you are interested in benchmarking I believe: multiple negated clauses vs a single negated clause that is a disjunction?
@jpountz I was thinking of the opposite case actually i.e. rewriting multiple NOT
clauses under disjunctive clauses having only SHOULD
or NOT
clauses into a single negated clause that is a conjuctive. i.e.
(-A *:*) (-B *:*)
-> -(+A +B) *:*
(improves performance as per initial benchmarks)-A -B *:*
-> -(A B) *:*
(degrades performance as per initial benchmarks)I randomly sampled n
terms (i.e. 2, 4, 6, 8) from the And*
and Or*
tasks so it include a good combination of all freq terms to run the benchmark against the two mentioned 2 rewrite possibilities. If req. I'll rerun it with only the HighTerm
ones as @mikemccand mentioned since currently it includes all sorts of terms(low, med & high).
Results for :
(-A *:*) (-B *:*)
-> -(+A +B) *:*
- Performance improves drastically (I think could we trust these results?)
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrNegated2Terms 5.16 (15.0%) 5.16 (16.2%) -0.1% ( -27% - 36%) 0.985
OrNegated2TermsOpt 2680.67 (4.3%) 2658.35 (3.9%) -0.8% ( -8% - 7%) 0.525
OrNegated4Terms 2.91 (16.0%) 2.90 (17.1%) -0.2% ( -28% - 39%) 0.967
OrNegated4TermsOpt 1766.87 (4.3%) 1783.57 (3.4%) 0.9% ( -6% - 8%) 0.437
OrNegated6Terms 2.11 (17.3%) 2.11 (18.4%) -0.2% ( -30% - 42%) 0.968
OrNegated6TermsOpt 1441.13 (3.6%) 1456.27 (2.7%) 1.1% ( -5% - 7%) 0.294
OrNegated8Terms 1.58 (17.0%) 1.58 (18.1%) -0.3% ( -30% - 41%) 0.961
OrNegated8TermsOpt 1087.37 (2.3%) 1088.95 (2.8%) 0.1% ( -4% - 5%) 0.857
-A -B *:*
-> -(A B) *:*
- Performance degrades TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
AndNegated2Terms 3133.71 (3.8%) 3163.54 (4.2%) 1.0% ( -6% - 9%) 0.457
AndNegated2TermsOpt 2387.24 (3.3%) 2418.85 (2.3%) 1.3% ( -4% - 7%) 0.141
AndNegated4Terms 2152.25 (2.6%) 2158.65 (3.4%) 0.3% ( -5% - 6%) 0.757
AndNegated4TermsOpt 1749.57 (2.0%) 1785.41 (3.2%) 2.0% ( -3% - 7%) 0.014
AndNegated6Terms 1548.31 (3.2%) 1545.62 (2.5%) -0.2% ( -5% - 5%) 0.850
AndNegated6TermsOpt 1385.68 (3.0%) 1411.57 (2.8%) 1.9% ( -3% - 7%) 0.039
AndNegated8Terms 1282.64 (2.5%) 1281.78 (2.7%) -0.1% ( -5% - 5%) 0.935
AndNegated8TermsOpt 1184.57 (2.3%) 1189.02 (2.7%) 0.4% ( -4% - 5%) 0.636
I'm still not happy with the 2-3 K QPS :) Something seems amiss. The benchmark writes detailed "results" files for each iteration -- can you peek at those and confirm your two forms of the same query are in fact getting the same top N hits? That would be a quick sanity check.
It might be that BMW is kicking in, in some cases and not others? Maybe we can forcefully disable BMW if so -- we are trying to test the full iteration cost here (e.g. a query that wants to do facet counting)?
I think your case 2 above is a more "typical" comparison. Case 1 is causing MatchAllDocsQuery
to run N times on the left, but once on the right (not really a fair comparison), vs case 2 which runs that query once on both baseline and competitor. But, still, 1-2K QPS results is not trustworthy -- something is up.
I'm still not happy with the 2-3 K QPS :) Something seems amiss
I suspect it's because the only scoring query produces a constant score. So in practice, Lucene will stop processing the query as soon as it collect 1,000 hits (IndexSearcher.TOTAL_HITS_THRESHOLD
).
I'm still not happy with the 2-3 K QPS :) Something seems amiss
I suspect it's because the only scoring query produces a constant score. So in practice, Lucene will stop processing the query as soon as it collect 1,000 hits (
IndexSearcher.TOTAL_HITS_THRESHOLD
).
Thanks @jpountz -- that makes sense. Since we are trying to measure whether the intersection / union cost of BooleanQuery
is cheaper with a single negated sub-clause, would it make sense to disable BMW for this benchy? It'd more accurately measure that full cost ... and would be somewhat realistic for apps that are doing aggregations/faceting and need to visit all hits? Maybe we need to make this easier to do in luceneutil
...
I did a grep in benchmarks logs. Here is what I got :
grep "TASK: cat=OrNegated2Terms " logs/* | uniq
logs/baseline_vs_patch.baseline.0:TASK: cat=OrNegated2Terms q=(-body:editing *:*) (-body:four *:*) s=null group=null hits=6005973+ facets=[]
logs/baseline_vs_patch.baseline.1:TASK: cat=OrNegated2Terms q=(-body:editing *:*) (-body:four *:*) s=null group=null hits=6005973+ facets=[]
""
""
grep "TASK: cat=OrNegated2TermsOpt " logs/* | uniq
logs/baseline_vs_patch.baseline.0:TASK: cat=OrNegated2TermsOpt q=-(+body:external +body:28) *:* s=null group=null hits=1001+ facets=[]
logs/baseline_vs_patch.baseline.1:TASK: cat=OrNegated2TermsOpt q=-(+body:external +body:28) *:* s=null group=null hits=1001+ facets=[]
""
""
grep "TASK: cat=AndNegated2Terms " logs/* | uniq
logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated2Terms q=-body:birth -body:john *:* s=null group=null hits=1001+ facets=[]
logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated2Terms q=-body:birth -body:john *:* s=null group=null hits=1001+ facets=[]
""
""
grep "TASK: cat=AndNegated2TermsOpt " logs/* | uniq
logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated2TermsOpt q=-(body:infantry body:title) *:* s=null group=null hits=1001+ facets=[]
logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated2TermsOpt q=-(body:infantry body:title) *:* s=null group=null hits=1001+ facets=[]
""
""
The benchmark writes detailed "results" files for each iteration -- can you peek at those and confirm your two forms of the same query are in fact getting the same top N hits? That would be a quick sanity check.
@mikemccand I see one of the queries for OrNegated2Terms
had hits=6005973+
compared to a query in OrNegated2TermsOpt
that had hits=1001+
. So seems we are stopping early for OrNegated2TermsOpt
after collecting 1000 hits as @jpountz mentioned. Also I only see one query task in the logs, so couldn't compare 2 same queries against each other. Do we log only 1 query per task in the benchmark logs?
Case 1 is causing MatchAllDocsQuery to run N times on the left, but once on the right (not really a fair comparison)
My understanding here was we are trying to rewrite the query in such a way 2 queries produces the same result but is more optimized(we could confirm the overlap numbers). So why do we still care about running MatchAllDocsQuery
N times vs 1 time incase if they are give identical results? or we don't?
The benchmark writes detailed "results" files for each iteration -- can you peek at those and confirm your two forms of the same query are in fact getting the same top N hits? That would be a quick sanity check.
@mikemccand I see one of the queries for
OrNegated2Terms
hadhits=6005973+
compared to a query inOrNegated2TermsOpt
that hadhits=1001+
. So seems we are stopping early forOrNegated2TermsOpt
after collecting 1000 hits as @jpountz mentioned. Also I only see one query task in the logs, so couldn't compare 2 same queries against each other. Do we log only 1 query per task in the benchmark logs?
Indeed, the 1001+ means BMW applied (see this nice blog post about it, thank you @jpountz!).
So this means your first test ((-A *:*) (-B *:*) -> -(+A +B) *:*
) is also testing "BMW applies" against "BMW does not apply", which is maybe not what we want to test? Or, I suppose, rewriting a query such that BMW becomes enabled could be seen as an awesome optimization! Still, I think we should disable BMW for all runs? Let's see which query form is faster for Lucene iterating many hits?
(I do wonder why it's 1001+ and not 1000+ but probably there is a harmless ob1 somewhere, heh).
Note that it's not stopping early -- the BMW optimization uses skipping to jump through the postings blocks to find possibly competitive (might crack into the top K) hits. It is an admissible optimization, meaning it is 100% correct (unless we have exciting bugs) and will always give the same top K hits returned as if the whole index was checked.
In your case the scores are quite synthetic (just 1f from *:*
query I suspect?) and so BMW likely is in fact able to stop after collecting the first 1001 hits (since we tie-break by docid). If you look in your results file you should see the scores of the K hits collected for each task.
Also I only see one query task in the logs, so couldn't compare 2 same queries against each other. Do we log only 1 query per task in the benchmark logs?
Hmm that's odd. Each specific query should have been executed N times by M threads, so you should see N * M instances of it in each iteration's results file. Can you post one of your results files for each of baseline and candidate?
Case 1 is causing MatchAllDocsQuery to run N times on the left, but once on the right (not really a fair comparison)
My understanding here was we are trying to rewrite the query in such a way 2 queries produces the same result but is more optimized(we could confirm the overlap numbers). So why do we still care about running
MatchAllDocsQuery
N times vs 1 time incase if they are give identical results? or we don't?
OK indeed that is the goal here (rewrite queries to faster form). I guess I feel like case 2) is the more common real-world example? "search for stuff but filter stuff out" use case.
I suppose, rewriting a query such that BMW becomes enabled could be seen as an awesome optimization! Still, I think we should disable BMW for all runs? Let's see which query form is faster for Lucene iterating many hits?
Makes sense! Is there any straightforward way to disable BMW for luceneutil
benchmarks or for a query? I couldn't find any helpful documentation around it both here and in apache lucene.
Hmm that's odd. Each specific query should have been executed N times by M threads, so you should see N * M instances of it in each iteration's results file. Can you post one of your results files for each of baseline and candidate?
@mikemccand I uploaded one log file for each baseline and candidate in the recent commit(under logs
dir).
Note : baseline and candidate checked out code are exactly same, since we are doing what rewrite optimization would have achieved but using 2 separate query tasks here having different opt/non-opt versions of same queries.
If we grep in the log files for the text "OrNegated2Terms " or "OrNegated2TermsOpt " I see both tasks using different query though that query indeed has NM(120=20) instances in the single log file. But what about the other queries of the task?....If I'm right there are 500 queries for each task so I expected to find all of them or alteast more but looks like there is a query(random out of 500?) picked for each iteration(out of 20) and runs 20(N*M) times i.e. total 20 different queries?. This is why wasn't able to compare 1-1 for a single query. Very likely, I might be misinterpreting something here.
OK indeed that is the goal here (rewrite queries to faster form). I guess I feel like case 2) is the more common real-world example? "search for stuff but filter stuff out" use case.
I agree! 2nd case looks more common use case compare to 1st.
@mikemccand @jpountz
I ran benchmarks again and this time increased TOTAL_HITS_THRESHOLD
in IndexSearcher.java to Integer.MAX_VALUE
to disable BMW and not restrict counting only 1000 hits. I see performance is improved in case of optimized queries (i.e. *TermsOpt tasks). Also pasted the logs snippet to confirm the total hits are indeed similar and not restricted to 1000 like earlier run.
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrNegated2Terms 1.74 (10.3%) 1.73 (9.7%) -0.3% ( -18% - 21%) 0.917
OrNegated2TermsOpt 5.85 (17.0%) 6.03 (17.1%) 3.0% ( -26% - 44%) 0.575
OrNegated4Terms 1.15 (12.0%) 1.15 (11.2%) -0.1% ( -20% - 26%) 0.969
OrNegated4TermsOpt 5.90 (17.4%) 6.09 (17.3%) 3.1% ( -26% - 45%) 0.576
OrNegated6Terms 0.81 (11.7%) 0.81 (10.7%) -0.1% ( -20% - 25%) 0.974
OrNegated6TermsOpt 5.93 (17.4%) 6.12 (17.4%) 3.2% ( -26% - 46%) 0.558
OrNegated8Terms 0.68 (13.6%) 0.68 (12.7%) -0.1% ( -23% - 30%) 0.980
OrNegated8TermsOpt 5.94 (17.5%) 6.13 (17.5%) 3.2% ( -27% - 46%) 0.564
AndNegated2Terms 4.22 (11.1%) 4.29 (11.7%) 1.6% ( -19% - 27%) 0.653
AndNegated2TermsOpt 4.60 (12.4%) 4.67 (12.9%) 1.7% ( -21% - 30%) 0.679
AndNegated4Terms 2.07 (4.8%) 2.09 (5.6%) 1.1% ( -8% - 12%) 0.522
AndNegated4TermsOpt 2.51 (6.1%) 2.54 (6.8%) 1.3% ( -10% - 15%) 0.534
AndNegated6Terms 2.03 (4.8%) 2.05 (5.6%) 1.0% ( -8% - 11%) 0.524
AndNegated6TermsOpt 2.58 (6.3%) 2.61 (6.9%) 1.3% ( -11% - 15%) 0.524
AndNegated8Terms 2.06 (4.7%) 2.09 (5.4%) 1.1% ( -8% - 11%) 0.485
AndNegated8TermsOpt 2.29 (5.4%) 2.32 (6.1%) 1.2% ( -9% - 13%) 0.506
grep "TASK: cat=OrNegated2Terms " logs/* | uniq
baseline_vs_patch.baseline.0:TASK: cat=OrNegated2Terms q=(-body:eric *:*) (-body:kansas *:*) s=null group=null hits=33331640 facets=[]
baseline_vs_patch.baseline.1:TASK: cat=OrNegated2Terms q=(-body:eric *:*) (-body:kansas *:*) s=null group=null hits=33331640 facets=[]
grep "TASK: cat=OrNegated2TermsOpt " ~/luceneutil-repo/logs/* | uniq
baseline_vs_patch.baseline.0:TASK: cat=OrNegated2TermsOpt q=-(+body:middle +body:area_code) *:* s=null group=null hits=33331952 facets=[]
baseline_vs_patch.baseline.1:TASK: cat=OrNegated2TermsOpt q=-(+body:middle +body:area_code) *:* s=null group=null hits=33331952 facets=[]
grep "TASK: cat=AndNegated2Terms " ~/luceneutil-repo/logs/* | uniq
baseline_vs_patch.baseline.0:TASK: cat=AndNegated2Terms q=-body:history -body:mountain *:* s=null group=null hits=30947732 facets=[]
baseline_vs_patch.baseline.1:TASK: cat=AndNegated2Terms q=-body:history -body:mountain *:* s=null group=null hits=30947732 facets=[]
grep "TASK: cat=AndNegated2TermsOpt " ~/luceneutil-repo/logs/* | uniq
baseline_vs_patch.baseline.0:TASK: cat=AndNegated2TermsOpt q=-(body:ben body:26) *:* s=null group=null hits=31603551 facets=[]
baseline_vs_patch.baseline.1:TASK: cat=AndNegated2TermsOpt q=-(body:ben body:26) *:* s=null group=null hits=31603551 facets=[]
These gains indeed look correct (identical hit counts from X and XOpt tasks) and significant, especially for the N-term OR cases! Thanks @shubhamvishu.
I thought we had a Lucene issue to explore having Lucene rewrite BQ clauses e.g. -A -B -C -> -(A B C)
? Can you link it here?
Actually, I'm not sure how the OrNegatedNTerms
rewrite case would work? It's a strange query e.g. q=(-body:eric *:*) (-body:kansas *:*)
. I'm not sure this really happens in practice very often?
Maybe we first explore rewriting the AndNegatedNTerms
case?
So I changed the new tasks to count only tasks as it was pointed out as alternative in #265 and pushed the tasks file I used and the logs file. For AndNegatedNTerms
case, I had changed the tasks from example count(-most -september *:*)
(non optimized) to count(-(most september)) *:*
(optimized) and the I see no improvement with this rewriting. Though in the less realistic OrNegatedNTerms
case improvements are constants across like in earlier runs . @mikemccand does this way makes more sense or we could stick to earlier results we saw when increasing TOTAL_HITS_THRESHOLD in IndexSearcher.java to Integer.MAX_VALUE
?
Note : I tried writing `count(-most -september :)task as
count(-(most september) :)but that seem to do what we want and results in 0 results. so I had to take the
:part out of
count()`
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrNegated2Terms 1.51 (5.6%) 1.47 (5.6%) -2.5% ( -12% - 9%) 0.160
OrNegated2TermsOpt 80.08 (18.9%) 77.57 (21.3%) -3.1% ( -36% - 45%) 0.622
OrNegated4Terms 3.35 (26.9%) 3.26 (25.6%) -2.7% ( -43% - 68%) 0.749
OrNegated4TermsOpt 73.31 (17.8%) 71.27 (20.3%) -2.8% ( -34% - 42%) 0.645
OrNegated6Terms 1.40 (18.3%) 1.37 (16.4%) -2.7% ( -31% - 39%) 0.622
OrNegated6TermsOpt 326.74 (55.5%) 303.54 (54.4%) -7.1% ( -75% - 231%) 0.683
OrNegated8Terms 0.60 (14.7%) 0.58 (13.6%) -4.0% ( -28% - 28%) 0.372
OrNegated8TermsOpt 105.69 (24.5%) 102.33 (26.7%) -3.2% ( -43% - 63%) 0.695
AndNegated2Terms 377.72 (59.5%) 351.59 (58.9%) -6.9% ( -78% - 275%) 0.712
AndNegated2TermsOpt 4.51 (4.1%) 4.50 (4.0%) -0.3% ( -8% - 8%) 0.835
AndNegated4Terms 3.91 (3.7%) 3.89 (3.6%) -0.5% ( -7% - 6%) 0.649
AndNegated4TermsOpt 4.14 (3.8%) 4.13 (3.3%) -0.3% ( -7% - 7%) 0.789
AndNegated6Terms 3.79 (3.5%) 3.79 (3.0%) -0.1% ( -6% - 6%) 0.911
AndNegated6TermsOpt 2.18 (3.0%) 2.17 (2.7%) -0.3% ( -5% - 5%) 0.724
AndNegated8Terms 7.06 (3.4%) 7.04 (3.3%) -0.4% ( -6% - 6%) 0.721
AndNegated8TermsOpt 3.08 (3.3%) 3.07 (2.9%) -0.2% ( -6% - 6%) 0.817
grep "TASK: cat=OrNegated2Terms " logs/* | uniq
logs/baseline_vs_patch.baseline.0:TASK: cat=OrNegated2Terms q=-body:basic *:* (-body:you *:*) countOnlyCount=33127647
logs/baseline_vs_patch.baseline.1:TASK: cat=OrNegated2Terms q=-body:basic *:* (-body:you *:*) countOnlyCount=33127647
grep "TASK: cat=OrNegated2TermsOpt " ~/luceneutil-repo/logs/* | uniq
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=OrNegated2TermsOpt q=-(+body:677 +body:him) *:* countOnlyCount=33332481
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=OrNegated2TermsOpt q=-(+body:677 +body:him) *:* countOnlyCount=33332481
grep "TASK: cat=AndNegated2Terms " ~/luceneutil-repo/logs/* | uniq
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated2Terms q=-body:id -body:former *:* countOnlyCount=29975741
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated2Terms q=-body:id -body:former *:* countOnlyCount=29975741
grep "TASK: cat=AndNegated2TermsOpt " ~/luceneutil-repo/logs/* | uniq
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated2TermsOpt q=-(body:old body:26) *:* countOnlyCount=30686467
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated2TermsOpt q=-(body:old body:26) *:* countOnlyCount=30686467
grep "TASK: cat=AndNegated4Terms " ~/luceneutil-repo/logs/* | uniq
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated4Terms q=-body:dreamed -body:film -body:23 -body:may *:* countOnlyCount=27339654
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated4Terms q=-body:dreamed -body:film -body:23 -body:may *:* countOnlyCount=27339654
grep "TASK: cat=AndNegated4TermsOpt " ~/luceneutil-repo/logs/* | uniq
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.0:TASK: cat=AndNegated4TermsOpt q=-(body:had body:5 body:can body:budget) *:* countOnlyCount=25254461
/home/cshbha/luceneutil-repo/logs/baseline_vs_patch.baseline.1:TASK: cat=AndNegated4TermsOpt q=-(body:had body:5 body:can body:budget) *:* countOnlyCount=25254461
I thought we had a Lucene issue to explore having Lucene rewrite BQ clauses e.g. -A -B -C -> -(A B C)? Can you link it here?
I don't remember coming across any open issue in Lucene queue for this. I was thinking of doing that once we consolidate the findings here based on the benchmarks results and then pursue it in Lucene. Maybe should I open one right away?
Maybe we first explore rewriting the AndNegatedNTerms case?
If the above approach of disabling BMW using count()
seems right then I wonder we are not seeing any gains in the AndNegatedNTerms
case, but with the earlier approach of maxing out TOTAL_HITS_THRESHOLD in IndexSearcher.java to Integer.MAX_VALUE the results look more reliable and less noisy to me. Looking for suggestions here.
Note : I tried writing count(-most -september :) task as count(-(most september) :) but that seem to do what we want and results in 0 results. so I had to take the : part out of count()
Did you mean that seemed NOT to do what we want? Likely this is a bug in luceneutil
's parsing of the query? count(-(most september) *:*)
should produce non-zero hits. Can you try debugging the luceneutil
parsing of this query -- e.g. print the resulting Query
that's passed to IndexSearcher.count
and see if it's correct?
But also in your grep
output it seems like you did not take the *:*
out of the AndNegatedNTermsOpt
tasks? I'm confused...
Maybe we first explore rewriting the AndNegatedNTerms case?
If the above approach of disabling BMW using
count()
seems right then I wonder we are not seeing any gains in theAndNegatedNTerms
case, but with the earlier approach of maxing out TOTAL_HITS_THRESHOLD in IndexSearcher.java to Integer.MAX_VALUE the results look more reliable and less noisy to me. Looking for suggestions here.
Well, count()
is dangerous because Lucene has sub-linear optimizations it may do (e.g. the high QPS for AndNegated2Terms
above looks like such an count opto is applying). Increasing TOTAL_HITS_THRESHOLD
(temporarily for benchmarking) is more of a guarantee that Lucene must iterate all the postings, so I trust these results more.
Can you post one of your results files for each of baseline and candidate? I'm still confused why you see only one instance of each task in the benchmark output files... and could you also post you perf.by
script and the command-line you're using to invoke it? Thanks.
Did you mean that seemed NOT to do what we want? Likely this is a bug in luceneutil's parsing of the query? count(-(most september) :) should produce non-zero hits. Can you try debugging the luceneutil parsing of this query -- e.g. print the resulting Query that's passed to IndexSearcher.count and see if it's correct?
@mikemccand Yes, it seems to be doing wrong with the way the task is parsed. I checked the query paseed to IndexSearcher#count
for task : count(-(most september) *:*)
is -(body:2001 body:left *:*)
which is incorrect as the MatchAllDocsQuery
clause is moved within the negated clause which is leading to zero hits.
But also in your grep output it seems like you did not take the : out of the AndNegatedNTermsOpt tasks? I'm confused...
Here is how the a task is mapped to Lucene query and its equivalent hits :
Task | Lucene Query | Hits (countOnlyCount from logs) |
|
---|---|---|---|
1. | AndNegated2Terms: count(-most -september *:*) |
-body:most -body:september *:* |
29781756 |
2. | AndNegated2TermsOpt: count(-(most september)) *:* |
-(body:press body:non) *:* |
31064896 |
3. | ZeroHitsAndNegated2TermsOpt: count(-(most september) *:*) |
-(body:2001 body:left *:*) |
0 |
The issue here seems to be with the ZeroHitsAndNegated2TermsOpt
as it moves the *:*
within negated clause. I have added tasks for ZeroHitsAndNegated2TermsOpt
case in the new commit if incase someone wants to take a look.Ideally we want the behaviour of 2nd row i.e. AndNegated2TermsOpt
to test the rewrite optimization but since 3. was not working as expected I went ahead with 2. and moved the *:*
out of count()
. If this makes sense and 3. indeed seems wrong, maybe we should open an issue to fix that?
Increasing TOTAL_HITS_THRESHOLD (temporarily for benchmarking) is more of a guarantee that Lucene must iterate all the postings, so I trust these results more.
Great! So we could hope some QPS improvement with this rewrite change I guess. If the testing so far seems convincing enough, I could go ahead and prepare a PR for this in Apache Lucene with the benchmarking methodology of maxing out IndexSearcher.TOTAL_HITS_THRESHOLD
against the tasks added in this commit (Task file : wikimedium.10M.nostopwords.tasks). Any thoughts? @mikemccand @jpountz
Can you post one of your results files for each of baseline and candidate? I'm still confused why you see only one instance of each task in the benchmark output files... and could you also post you perf.by script and the command-line you're using to invoke it? Thanks.
I have also attached both the logs file in the latest commit on this PR (its under the folder logs
). Yeah, I'm not sure why I still see only one occurrence in the logs as I'm running vanilla benchmarks and not changed anything in benchmarks code.
could you also post you perf.by script and the command-line you're using to invoke it?
I'm running the benchmarks using python3 src/python/localrun.py -source wikimediumall
along with the changes in this PR(i.e. the new added tasks). I'm using the vanilla localrun.py
(I suppose you mean this by perf.py
) and haven't changed anything in the benchmarks except the newly added tasks.
Hi @mikemccand @jpountz , I'm looking for review on the above results. Any thoughts on these results?
Looking for some thoughts here. cc - @mikemccand @jpountz
Added 2 new tasks for multi keyword negated queries for performance comparison.
(-A -B -C -D)
-(A B C D)
UPDATE : See in the comments below for new updated results
Benchmark results :
Observation :
OrNegatedHighHighOpt
is more than 100% faster thanOrNegatedHighHigh
or conversely, the current implementation is >50% slower.Full results :
I'm working on fixing this with BQ.rewrite. Will open a issue/PR in Lucene once I have something concrete or I'll create one straight away. We can use the added tasks in this PR to test if the rewrite change is making the performance of queries with many negated keywords comparable to its optimized version.
@mikemccand : If this change looks good, maybe we could also add negated query tasks to other tasks files other than for
wikimediumall
?