mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Can we simplify conjunctions of range queries automatically? [LUCENE-8708] #707

Open mikemccand opened 5 years ago

mikemccand commented 5 years ago

BooleanQuery#rewrite already has some logic to make queries more efficient, such as deduplicating filters or rewriting boolean queries that wrap a single positive clause to that clause.

It would be nice to also simplify conjunctions of range queries, so that eg. foo: [5 TO *] AND foo:[* TO 20] would be rewritten to foo:[5 TO 20]. When constructing queries manually or via the classic query parser, it feels unnecessary as this is something that the user can fix easily. However if you want to implement a query parser that only allows specifying one bound at once, such as Gmail (after:2018-12-31 https://support.google.com/mail/answer/7190?hl=en) or GitHub (updated:>=2018-12-31 https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated) then you might end up with inefficient queries if the end user specifies both an upper and a lower bound. It would be nice if we optimized those automatically.


Legacy Jira details

LUCENE-8708 by Adrien Grand (@jpountz) on Feb 25 2019, updated May 01 2019 Attachments: interval_range_clauses_merging0704.patch

mikemccand commented 5 years ago

We could extend this approach to identify overlapping ranges ([5, 20], [15, 35] can be converted to 5 to 35).

 

I can take a crack at this one, if you are not planning to actively work on it

[Legacy Jira: Atri Sharma (@atris) on Feb 25 2019]

mikemccand commented 5 years ago

Agreed it'd be nice to have this sort of case optimized as well. I have seen it happen with automatically-generated queries sometimes.

I'm not going to work actively on it in the near future. Feel free to give it a try to see how this could look like.

[Legacy Jira: Adrien Grand (@jpountz) on Feb 25 2019]

mikemccand commented 5 years ago

Attached is a WIP patch for the same. There is one existing test which needs to be refactored to comply with the new API, which I will do before the final commit. The intent of this patch is to get early feedback and potential blockers.

This commit introduces the concept of ToString interface. While a bit of a controversial change, ToString interface is necessary to allow creation of new Range Queries of a given type post the merge. I am happy to replace it with any other alternatives that seem more sane.

interval_range_clauses_merging0704.patch

[Legacy Jira: Atri Sharma (@atris) on Apr 07 2019]

mikemccand commented 5 years ago

Thanks Atri for giving it a try! This change is a bit too invasive to my taste given that this is only a nice feature to have. That said I don't really have ideas how to make it better... 

[Legacy Jira: Adrien Grand (@jpountz) on Apr 09 2019]

mikemccand commented 5 years ago

Just an idea maybe bias for my background.

One of the issues here is that we visit the tree for each range and this is what we are trying to improve. Maybe adding a query that can accept more than one range with a logical relationship ('AND', 'OR',...) might be less invasive and encapsulates the logic.

[Legacy Jira: Ignacio Vera (@iverase) on Apr 09 2019]

mikemccand commented 5 years ago

@ivera Thanks, that makes sense. I have created an issue for the same: https://issues.apache.org/jira/browse/LUCENE-8769

 

However, I think that we should still optimize overlapping ranges as this issue proposes so that existing users also get the performance advantage.

 

@jpountz Any thoughts on how we could simplify the patch?

[Legacy Jira: Atri Sharma (@atris) on Apr 18 2019]

mikemccand commented 5 years ago

Hmm why do we need the PointRangeQuery.ToStringInterface?  Also, why did we need to comment on that one test case – testInvalidPointLength?

[Legacy Jira: Michael McCandless (@mikemccand) on Apr 30 2019]

mikemccand commented 5 years ago

ToStringInteface is a necessary evil required to create new PointRangeQuery instances post the merging of the interval. In the current state of affairs, BooleanQuery has no visibility into the type of the range query that it is dealing with (which is a great thing). However, that limits the ability to create new range queries of the parent type directly. ToStringInterface allows a polymorphic way to allow new range queries to be created in BooleanQuery.rewrite.

 

Regarding testInvalidPointLength, that test needs to be refactored to work with ToStringInterface. However, since that is not a blocker to the actual functionality, I chose not to spend time on it until we have more clarity on the direction in which we wish to head.

[Legacy Jira: Atri Sharma (@atris) on May 01 2019]