Open mikemccand opened 5 years ago
I confess I do not have a very clean idea as to how this can be implemented: the typical usages of FieldComparator mandate that the user maintain a list of slots into the FieldComparator, which can implicitly be as bad in terms of size as the queue itself. FieldComparator provides a convenient API to allow comparisons between two values of the type maintained in the queue, which can form the basis of this observation.
Here is the first cut of proposal that I have in mind:
1) Deprecate compare(slot, slot) so that new implementations do not depend on this method, but rather use compare(T val, T val).
2) Start with some comparators (Numeric comparators?), get rid of the implicit priority queue and make the user maintain those values.
3) Make Numeric comparators track only the top and bottom values, as needed.
Note that I am treating NumericComparators as the starting point/example, but the approach should extend for other comparators as well.
With https://github.com/apache/lucene-solr/pull/831, getting values out of leaf comparators should be easy, so the logical step after this PR is to depend on compare (val, val) more than we rely on compare (slot, slot).
Happy to receive feedback and alternate proposals
[Legacy Jira: Atri Sharma (@atris) on Aug 14 2019]
This looks like a duplicate of LUCENE-8878?
I think all of us agree on the fact that it would be nice to have a simpler FieldComparator API. The challenge is that we don't want to trade too much efficiency. For instance the API you are proposing wouldn't work well with geo-distance sorting since it would require computing the actual distance for every new document, while the current implementation tries to be smart to first check a bounding box, and then compute a sort key that compares like the actual distance but is much cheaper to compute (see discussion on LUCENE-8878 for more details).
[Legacy Jira: Adrien Grand (@jpountz) on Aug 14 2019]
This looks like a duplicate of LUCENE-8878?
Not necessarily – 8878 targets refactoring the API to be simpler, whereas this Jira only targets removing the necessary condition that FieldComparators maintain their own priority queues. I believe this Jira compliments 8878.
I think all of us agree on the fact that it would be nice to have a simpler FieldComparator API. The challenge is that we don't want to trade too much efficiency. For instance the API you are proposing wouldn't work well with geo-distance sorting since it would require computing the actual distance for every new document, while the current implementation tries to be smart to first check a bounding box, and then compute a sort key that compares like the actual distance but is much cheaper to compute
Agreed, that is precisely why I suggested deprecating compare (slot, slot) instead of removing it completely. The idea is that comparators that require access to an internal PQ for whatever reasons are free to do so, but it should not be mandatory, and future comparators should not take on this dependency without understanding the tradeoffs
[Legacy Jira: Atri Sharma (@atris) on Aug 14 2019]
OK I got confused with the notion of deprecation, which suggests to me that we need to find a replacement, but reading your last message, my understanding is that you would like to introduce a sub class of FieldComparator that hides the fact that it maintains an implicit PQ, and make simple comparators extend this sub class instead of FieldComparator directly? I would be +1 to giving it a try and making the change if the perf hit is negligible.
[Legacy Jira: Adrien Grand (@jpountz) on Aug 14 2019]
you would like to introduce a sub class of FieldComparator that hides the fact that it maintains an implicit PQ, and make simple comparators extend this sub class instead of FieldComparator directly?
Yes, exactly.
Thanks for validating – I will work on a PR now.
[Legacy Jira: Atri Sharma (@atris) on Aug 14 2019]
While doing some perf tests, I realised that FieldComparators inherently maintain implicit priority queues for maintaining the sorted order of documents for the given sort order. This is wasteful especially in the case of a multi feature sort order and a large number of hits requested.
We should change this to have FieldComparators maintain only the top and bottom values, and use them as barriers to compare
Legacy Jira details
LUCENE-8950 by Atri Sharma (@atris) on Aug 14 2019