mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

PhraseWildcardQuery - new query to control and optimize wildcard expansions in phrase [LUCENE-8983] #980

Closed mikemccand closed 5 years ago

mikemccand commented 5 years ago

A generalized version of PhraseQuery, built with one or more MultiTermQuery that provides term expansions for multi-terms (one of the expanded terms must match).

Its main advantage is to control the total number of expansions across all MultiTermQuery and across all segments.

This query is similar to MultiPhraseQuery, but it handles, controls and optimizes the multi-term expansions.

This query is equivalent to building an ordered SpanNearQuery with a list of SpanTermQuery and SpanMultiTermQueryWrapper. But it optimizes the multi-term expansions and the segment accesses. It first resolves the single-terms to early stop if some does not match. Then it expands each multi-term sequentially, stopping immediately if one does not match. It detects the segments that do not match to skip them for the next expansions. This often avoid expanding the other multi-terms on some or even all segments. And finally it controls the total number of expansions.


Legacy Jira details

LUCENE-8983 by Bruno Roustant (@bruno-roustant) on Sep 18 2019, resolved Nov 28 2019 Linked issues:

mikemccand commented 5 years ago

@klaporte  did you try this PhraseWildcardQuery? Do you have some feedback about it?

We will probably move it to lucene/sandbox.

[Legacy Jira: Bruno Roustant (@bruno-roustant) on Nov 14 2019]

mikemccand commented 5 years ago

Hi @bruno-roustant. I don't yet. The team we're working with is reluctant to make modifications to the software at this point as they have released to their beta clients. At present, we've shifted to testing this internally in the hopes of making progress there. 

[Legacy Jira: Ken LaPorte on Nov 14 2019]

mikemccand commented 5 years ago

I updated the PR.

Summary of the decision taken in the PR (see there for explanations):

[Legacy Jira: Bruno Roustant (@bruno-roustant) on Nov 21 2019]

mikemccand commented 5 years ago

I'll merge this PR within 2 days if there is no objection.

[Legacy Jira: Bruno Roustant (@bruno-roustant) on Nov 25 2019]

mikemccand commented 5 years ago

Commit 8485b5a939c5ffc4982dd338d59cdf090c5e1e58 in lucene-solr's branch refs/heads/master from Bruno Roustant https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8485b5a

LUCENE-8983: Add PhraseWildcardQuery to control multi-terms expansions in phrase.

[Legacy Jira: ASF subversion and git services on Nov 27 2019]

mikemccand commented 5 years ago

Commit d764bf345e2789589fbead7df5838dc20247c577 in lucene-solr's branch refs/heads/branch_8x from Bruno Roustant https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d764bf3

LUCENE-8983: Add PhraseWildcardQuery to control multi-terms expansions in phrase.

[Legacy Jira: ASF subversion and git services on Nov 27 2019]

mikemccand commented 5 years ago

This change seems to be causing test failures, e.g. https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-Solaris/431/

[Legacy Jira: Adrien Grand (@jpountz) on Nov 28 2019]

mikemccand commented 5 years ago

This change seems to be causing test failures

Looking into this.

[Legacy Jira: Bruno Roustant (@bruno-roustant) on Nov 28 2019]

mikemccand commented 5 years ago

The randomization made only one segment while I thought I ensured 2 segments even with randomization. To make the test more robust, I improved it to skip special segment test counters and just focus on query results and scores if there are not exactly 2 segments.

[Legacy Jira: Bruno Roustant (@bruno-roustant) on Nov 28 2019]

mikemccand commented 5 years ago

Commit 8bd5d7dd2edacc096805e9519656504f29ebd04e in lucene-solr's branch refs/heads/master from Bruno Roustant https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8bd5d7d

LUCENE-8983: TestPhraseWildcardQuery more robust wrt randomization.

[Legacy Jira: ASF subversion and git services on Nov 28 2019]

mikemccand commented 5 years ago

Commit e35de979916774937d854009387208c200f35584 in lucene-solr's branch refs/heads/branch_8x from Bruno Roustant https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e35de97

LUCENE-8983: TestPhraseWildcardQuery more robust wrt randomization.

[Legacy Jira: ASF subversion and git services on Nov 28 2019]

mikemccand commented 5 years ago

Thanks for looking so quickly @bruno-roustant.

[Legacy Jira: Adrien Grand (@jpountz) on Nov 28 2019]

mikemccand commented 4 years ago

Closing after the 8.4.0 release.

[Legacy Jira: Adrien Grand (@jpountz) on Dec 29 2019]