IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document [LUCENE-8990]

mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration

0 stars 0 forks source link

IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document [LUCENE-8990] #987

Closed mikemccand closed 5 years ago

mikemccand commented 5 years ago

Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range queries . The leadCost that is provided is based on number of documents, meanwhile the cost() of a range query is based on the number of points that potentially match the query.

Therefore it might happen that a BKD tree has millions of points but this points correspond to just a few documents. Therefore we can take the decision of executing the query using docValues and in fact we are almost scanning all the points.

Maybe the cost() function for range queries need to take into account the average number of points per document in the tree and adjust the value accordingly.

Legacy Jira details

LUCENE-8990 by Ignacio Vera (@iverase) on Sep 26 2019, resolved Oct 04 2019

mikemccand commented 5 years ago

+1, I think that is a good heuristic – strangely enough, I was thinking of this limitation for a similar problem.

Would it suffice if we just made PointRangeQuery also consider the BKDReader's docCount, in addition to pointCount? e.g. (cost = values.estimatePointCount() / values.estimateDocCount())?

[Legacy Jira: Atri Sharma (@atris) on Sep 26 2019]

mikemccand commented 5 years ago

I was thinking more something like:

double pointsPerDoc = values.size() / values.getDocCount();
values.estimatePointCount(visitor) / pointsPerDoc;

Maybe that can be abstracted out as a new method in PointValues like estimateDocCount().

[Legacy Jira: Ignacio Vera (@iverase) on Sep 26 2019]

mikemccand commented 5 years ago

I am happy to take a crack at this if you are not planning to do so.

[Legacy Jira: Atri Sharma (@atris) on Sep 26 2019]

mikemccand commented 5 years ago

Thanks Atri!

[Legacy Jira: Ignacio Vera (@iverase) on Sep 26 2019]

mikemccand commented 5 years ago

Hey @atris,

At the end I got carried over and I am opening a PR around this issue. Let me know what you think!

[Legacy Jira: Ignacio Vera (@iverase) on Sep 27 2019]

mikemccand commented 5 years ago

Commit 9942544a7fc9f1abfb70d70e7ebfe275134222f4 in lucene-solr's branch refs/heads/master from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9942544

LUCENE-8990: Add estimateDocCount(visitor) method to PointValues (#905)

[Legacy Jira: ASF subversion and git services on Oct 04 2019]

mikemccand commented 5 years ago

Commit e4ceb9763f1ca64d0603747add87646eec78c368 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e4ceb97

LUCENE-8990: Add estimateDocCount(visitor) method to PointValues (#905)

[Legacy Jira: ASF subversion and git services on Oct 04 2019]

mikemccand commented 5 years ago

Thanks @jpountz and @colings for the help!

[Legacy Jira: Ignacio Vera (@iverase) on Oct 04 2019]