Closed mikemccand closed 5 years ago
+1, I think that is a good heuristic – strangely enough, I was thinking of this limitation for a similar problem.
Would it suffice if we just made PointRangeQuery also consider the BKDReader's docCount, in addition to pointCount? e.g. (cost = values.estimatePointCount() / values.estimateDocCount())?
[Legacy Jira: Atri Sharma (@atris) on Sep 26 2019]
I was thinking more something like:
double pointsPerDoc = values.size() / values.getDocCount();
values.estimatePointCount(visitor) / pointsPerDoc;
Maybe that can be abstracted out as a new method in PointValues like estimateDocCount()
.
[Legacy Jira: Ignacio Vera (@iverase) on Sep 26 2019]
+1
I am happy to take a crack at this if you are not planning to do so.
[Legacy Jira: Atri Sharma (@atris) on Sep 26 2019]
Thanks Atri!
[Legacy Jira: Ignacio Vera (@iverase) on Sep 26 2019]
Hey @atris,
At the end I got carried over and I am opening a PR around this issue. Let me know what you think!
[Legacy Jira: Ignacio Vera (@iverase) on Sep 27 2019]
Commit 9942544a7fc9f1abfb70d70e7ebfe275134222f4 in lucene-solr's branch refs/heads/master from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9942544
LUCENE-8990: Add estimateDocCount(visitor) method to PointValues (#905)
[Legacy Jira: ASF subversion and git services on Oct 04 2019]
Commit e4ceb9763f1ca64d0603747add87646eec78c368 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e4ceb97
LUCENE-8990: Add estimateDocCount(visitor) method to PointValues (#905)
[Legacy Jira: ASF subversion and git services on Oct 04 2019]
Thanks @jpountz and @colings
for the help!
[Legacy Jira: Ignacio Vera (@iverase) on Oct 04 2019]
Heuristics of IndexOrDocValuesQuery are somewhat inconsistent for range queries . The leadCost that is provided is based on number of documents, meanwhile the cost() of a range query is based on the number of points that potentially match the query.
Therefore it might happen that a BKD tree has millions of points but this points correspond to just a few documents. Therefore we can take the decision of executing the query using docValues and in fact we are almost scanning all the points.
Maybe the cost() function for range queries need to take into account the average number of points per document in the tree and adjust the value accordingly.
Legacy Jira details
LUCENE-8990 by Ignacio Vera (@iverase) on Sep 26 2019, resolved Oct 04 2019