Closed mikemccand closed 5 years ago
+1 the patch looks good. Can you add an assert in the additional test that checks that the number of segments in the reader is 1 ?This is required since the test makes some assumptions regarding the distribution of the terms and this could be broken easily if we change the setUp for the entire test suite later.
[Legacy Jira: Jim Ferenczi (@jimczi) on Aug 01 2019]
It'd be clearer if TermsEnumDisjunctionMatchesIterator had a line of documentation somewhere that more explicitly points out that it is merely lazily wrapping the MatchesIterator because it may not be needed. Also maybe pulling out the initialization code to a method named as such, like init() would overall be clearer.
It's a shame FilterMatchesIterator cannot be used here, given all but one method simply delegates. We can't because the input is declared to be final. Do you think it's worth loosening that so that we can use it?
I confess I don't see how the test here validates the laziness. I anticipated you were going to create a boolean AND query including the MTQ and some other simple term query that sometimes doesn't match.
[Legacy Jira: David Smiley (@dsmiley) on Aug 01 2019]
Can you add an assert in the additional test that checks that the number of segments in the reader is 1?
getOnlyLeafReader()
already covers this, so I think we're OK.
It'd be clearer if TermsEnumDisjunctionMatchesIterator had a line of documentation somewhere that more explicitly points out that it is merely lazily wrapping the MatchesIterator because it may not be needed. Also maybe pulling out the initialization code to a method named as such, like init() would overall be clearer.
+1
It's a shame FilterMatchesIterator cannot be used here, given all but one method simply delegates. We can't because the input is declared to be final. Do you think it's worth loosening that so that we can use it?
This is a one-off, so I don't think it's worth loosening the requirements in FilterMatchesIterator yet. Maybe if we end up making more lazy iterators?
I confess I don't see how the test here validates the laziness. I anticipated you were going to create a boolean AND query including the MTQ and some other simple term query that sometimes doesn't match.
I'll add a comment to the test - the point is that documents 0 to 3 match several terms, and previously we would do multiple seeks when creating the Matches object for each document to load all of the matching sub-iterators; now with the laziness, we only seek to the first term and then short-cut return immediately, delaying the other seeks until a MatchesIterator is pulled. Docs 4 and 5 still require multiple seeks, because they either match on a later term (w5 as opposed to w1), or don't match at all, in which case we need to check every possible term to ensure that there's no match.
[Legacy Jira: Alan Woodward (@romseygeek) on Aug 05 2019]
Commit b5b78e0adeb9db9345b69abedabd8c5cd684df7b in lucene-solr's branch refs/heads/branch_8x from Alan Woodward https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b5b78e0
LUCENE-8941: Build wildcard matches lazily
[Legacy Jira: ASF subversion and git services on Aug 07 2019]
Commit fa72da1c7112a2e5c259f1f0181e6b27766ed4ad in lucene-solr's branch refs/heads/master from Alan Woodward https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fa72da1
LUCENE-8941: Build wildcard matches lazily
[Legacy Jira: ASF subversion and git services on Aug 07 2019]
When retrieving a Matches object from a multi-term query, such as an AutomatonQuery or TermInSetQuery, we currently find all matching term iterators up-front, to return a disjunction over all of them. This can be inefficient if we're only interested in finding out if anything matched, and are iterating over a different field to retrieve offsets.
We can improve this by returning immediately when the first matching term is found, and only collecting other matching terms when we start iterating.
Legacy Jira details
LUCENE-8941 by Alan Woodward (@romseygeek) on Jul 30 2019, resolved Aug 07 2019 Attachments: LUCENE-8941.patch (versions: 2)