mikemccand / luceneutil

Various utility scripts for running Lucene performance tests
Apache License 2.0
202 stars 113 forks source link

Add an option to disable BMW optimization for benchmarks #265

Open shubhamvishu opened 5 months ago

shubhamvishu commented 5 months ago

Description

Looking for more ideas on this!

jpountz commented 5 months ago

Could you use tasks where dynamic pruning doesn't apply instead of disabling it? E.g. use counting tasks?

mikemccand commented 5 months ago

Could you use tasks where dynamic pruning doesn't apply instead of disabling it? E.g. use counting tasks?

+1, that's a nice approach. Though even Lucene's count() API has some nice optimizations to bypass visiting all postings / sub-linear implementations I think?

jpountz commented 5 months ago

Indeed IndexSearcher#count has some optimizations to bypass postings. But it was mostly an example, some cheap faceting should work too?

shubhamvishu commented 5 months ago

Could you use tasks where dynamic pruning doesn't apply instead of disabling it? E.g. use counting tasks?

Do you mean to wrap the clauses with "count( )" like eg https://github.com/mikemccand/luceneutil/blob/master/tasks/countOnly.tasks so that we check the performance but avoid BMW? I like this idea if I understand correctly. But not sure if we could make it an option with benchmarks straightforwardly.

 

Indeed IndexSearcher#count has some optimizations to bypass postings. But it was mostly an example, some cheap faceting should work too?

I'm not sure what you mean by using some cheap faceting here. Maybe you could elaborate on this idea? Also, since we want to enable it via benchmarks, does this also fit well in that picture?

mikemccand commented 5 months ago

Indeed IndexSearcher#count has some optimizations to bypass postings. But it was mostly an example, some cheap faceting should work too?

I'm not sure what you mean by using some cheap faceting here. Maybe you could elaborate on this idea? Also, since we want to enable it via benchmarks, does this also fit well in that picture?

I think @jpountz is referring to enabling faceting on each task. luceneutil's TaskParser supports this with e.g. +facets:Date.sortedset. Because facets require counting all hits, it forces Lucene to disable BMW. The problem is, it also adds some cost (I think that's why @jpountz suggested finding a "cheap" one heh), which is not great because it dilutes what you are trying to measure (a change in postings decode / visit time).

Could you use tasks where dynamic pruning doesn't apply instead of disabling it? E.g. use counting tasks?

Do you mean to wrap the clauses with "count( )" like eg https://github.com/mikemccand/luceneutil/blob/master/tasks/countOnly.tasks so that we check the performance but avoid BMW? I like this idea if I understand correctly. But not sure if we could make it an option with benchmarks straightforwardly.

luceneutil supports count tasks with syntax like count(+a +b). This is parsed to use IndexSearcher's count API. I think that may be a quick workaround for benchmarking https://github.com/mikemccand/luceneutil/pull/258

shubhamvishu commented 5 months ago

Thanks for the explanation, Mike! I'll try benchmarking it change using count tasks and share the results. Btw, if the above-mentioned approach of maxing out IndexSearcher.TOTAL_HITS_THRESHOLD also makes sense, then in that case I had already shared the results for it over here.