UnifiedHighlighter should use single OffsetEnum rather than List<OffsetEnum> [LUCENE-8145]

mikemccand commented 6 years ago

The UnifiedHighlighter deals with several different aspects of highlighting: finding highlight offsets, breaking content up into snippets, and passage scoring. It would be nice to split this up so that consumers can use them separately.

As a first step, I'd like to change the API of FieldOffsetStrategy to return a single unified OffsetsEnum, rather than a collection of them. This will make it easier to expose the OffsetsEnum of a document directly from the highlighter, bypassing snippet extraction and scoring.

Legacy Jira details

LUCENE-8145 by Alan Woodward (@romseygeek) on Jan 31 2018, resolved Feb 02 2018 Attachments: LUCENE-8145.patch Linked issues:

LUCENE-8283

Pull requests: https://github.com/apache/lucene-solr/pull/317

mikemccand commented 6 years ago

This patch renames FieldOffsetStrategy#getOffsetsEnums() to FieldOffsetStrategy#getOffsetsEnum, and changes the return value from List<OffsetsEnum> to OffsetsEnum directly.

FieldHighlighter is simplified a bit, particularly in terms of handling OffsetsEnum as a closeable resource. Scoring is delegated to the Passage itself, which now keeps track of the within-passage frequencies of its highlighted terms and phrases. A new MultiOffsetsEnum class deals with combining multiple OffsetsEnums using a priority queue. Because all offsets are iterated in order, Passage no longer needs to worry about sorting its internal hits.

The APIs for FieldOffsetStrategy, Passage and OffsetEnum have all changed slightly, but they're all pretty expert so I think this could be targeted at 7.3?

cc @dsmiley @jimczi

[Legacy Jira: Alan Woodward (@romseygeek) on Jan 31 2018]

mikemccand commented 6 years ago

I should add that this is just a refactoring, and in particular passage scores are not changed.

[Legacy Jira: Alan Woodward (@romseygeek) on Jan 31 2018]

mikemccand commented 6 years ago

This is a really nice improvement Alan! I once thought about replacing List<OffsetsEnum> with a PriorityQueue based one as you did here but I didn't follow through because I didn't know how the OffsetsEnum.weight would accommodate that. Here you've shifted that tracking to the Passage.

We can go further with your refactoring – making CompositeOffsetsPostingsEnum obsolete. Delete it and then change FieldOffsetStrategy.createOffsetsEnumsForAutomata so that it's final loop looks like this:

    for (int i = 0; i < automata.length; i++) {
      CharacterRunAutomaton automaton = automata[i];
      List<PostingsEnum> postingsEnums = automataPostings.get🛈;
      if (postingsEnums.isEmpty()) {
        continue;
      }
      // Build one OffsetsEnum exposing the automata.toString as the term, and the sum of freq
      BytesRef wildcardTerm = new BytesRef(automaton.toString());
      int sumFreq = 0;
      for (PostingsEnum postingsEnum : postingsEnums) {
        sumFreq += postingsEnum.freq();
      }
      for (PostingsEnum postingsEnum : postingsEnums) {
        results.add(new OffsetsEnum.OfPostings(wildcardTerm, sumFreq, postingsEnum));
      }
    }

And then overload the OfPostings constructor to take a manual "freq". Or Subclass OfPostings in-kind (e.g. OfPostingsWithFreqc ould be inner class even (I played with this a little; it was annoying that sumFreq couldn't be final but no biggie).

Unfortunately it seems we have no test that ensures the "freq" is correct for a highlighted wildcard query whose wildcard matches multiple terms of varying frequencies each in the document. We should add such a test.

I definitely like how FieldHighlighter is simpler. You can go further and remove FieldHighlighter.EMPTY too – Rob had used that to simplify the queue initial state logic that is now obsoleted with your change (you chose a boolean "started" flag instead).

It's a shame Passage is now fairly heavy with a BytesRefHash on it. I want to think about that a bit more.

The first place maybeAddPassage is called, it's guarded by an if condition. But that if condition can be removed as it is redundant with the same logic that maybeAddPassage starts with. You should copy along the comment that explains what -1 means.

Nitpicks:

FieldHighlighter:

highlightOffsetsEnums: Add braces to the "if" block that returns early. The "while" of do-while should be on the same line as the close bracket.
maybeAddPassage: passage ought to be the first parameter IMO. And add braces to the "if" block.

FieldOffsetStrategy: the javadocs on getOffsetsEnum should not say "remember to close them all" since it just returns one. so maybe "remember to close it"

MultiOffsetsEnum.close: I see it calls close on all the remaining OffsetsEnums on the queue... but at this point it's likely empty. Based on our implementations of OffsetsEnum this won't be an issue but I think it's bad to leave it this way. I think nextPosition could be modified to close the "top" item when it reaches the end. close would then have a comment to mentioned the others have been closed already in nextPosition.

TestUnifiedHighlighterExtensibility: you removed calling p.setScore but I think we want to ensure all such methods are exposed thus enabling someone to fully use this Passage.

[Legacy Jira: David Smiley (@dsmiley) on Jan 31 2018]

mikemccand commented 6 years ago

Thanks for the review David! I'll put up another patch shortly with your suggestions.

re Automata - I agree that we can replace CompositeOffsetsPostingsEnum, but I wonder if we need to bother with the frequency summing? It would make more sense I think to preserve the freqs of the individual term matches, so that a rarer term is more relevant than a more frequent one. We don't do this with wildcard queries in general because of performance, but that's not an issue here.

Passage is heavier now, but the objects are re-used, and only n-fragments + 1 are build for each highlighted doc, so I'm not too concerned.

[Legacy Jira: Alan Woodward (@romseygeek) on Jan 31 2018]

mikemccand commented 6 years ago

but I wonder if we need to bother with the frequency summing?

It's debatable. Consider an aggressive MTQ like st* that hypothetically matches a lot of terms that each occur one time. Passages with those terms will be scored higher than a term query that matched twice.

It would be cool if we could further affect the passage score by a term's string-distance to the automata string. For example if "st" would have it's score dampened quite a bit if it matches "strangelyLongWord" but say only a small dampening for "stir". Artificially increasing the frequency would be one way, albeit less flexible than some other hook. If we had something like this, it'd probably matter less how accurate the frequency is since I think people would want to dampen the score for any MTQ.

Hmmm. With if Passage.setScore remains a simple setter, but we add PassageScorer.computeScore(Passage, int contentLength)? We'd need to expose more data from Passage that you added, granted, but it sure adds some flexibility!

CC @Timothy055

[Legacy Jira: David Smiley (@dsmiley) on Jan 31 2018]

mikemccand commented 6 years ago

Thanks for the CC @dsmiley.

@romseygeek really nice change! Definitely simplifies things quite a bit and conceptually one meta OffsetEnum over the field makes more sense than the list from previous.

I'm in favor of keeping the summed frequency on MTQ or at least preserving a mechanism to keep it on. The extra occurrences may not always seem spurious in all cases. For example, consider "expert" systems where users are accustomed to using wildcards for stemming-like expressions. E.g. purchas* for getting variants of the word purchase. In those cases, the extra frequency counts would hopefully select a better passage.

I'm not so sure about setScore being passed in a scorer and content length to set the score though. That feels awkward to me. If we were to keep it this way, I'd argue a Passage should receive the PassageScorer and content length at construction instead of via the setScore method. If we did that, I think we could incrementally build the score instead of tracking terms and frequencies for a later score calculation? Another choice is to move a lot of scoring behavior and perhaps introduce another class that's tracking the terms and score in a passage analagous to Weight?

[Legacy Jira: Timothy M. Rodriguez on Feb 01 2018]

mikemccand commented 6 years ago

I've opened a pull request on github: https://github.com/apache/lucene-solr/pull/317

Much easier to review than a series of patches...

[Legacy Jira: Alan Woodward (@romseygeek) on Feb 01 2018]

mikemccand commented 6 years ago

Thanks Alan, I left some comments in the pull request.

[Legacy Jira: Jim Ferenczi (@jimczi) on Feb 01 2018]

mikemccand commented 6 years ago

Commit aa5c3a5c222aef4cab7a4fe0c2f95a3eef7f4bd7 in lucene-solr's branch refs/heads/branch_7x from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aa5c3a5

LUCENE-8145: FieldOffsetStrategy.getOffsetEnum() now returns a single MultiOffsetsEnum

Closes #317

[Legacy Jira: ASF subversion and git services on Feb 02 2018]

mikemccand commented 6 years ago

Commit 9422609a5d4dc2838256c743d480f472740dfd25 in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9422609

LUCENE-8145: FieldOffsetStrategy.getOffsetEnum() now returns a single MultiOffsetsEnum

Closes #317

[Legacy Jira: ASF subversion and git services on Feb 02 2018]

mikemccand commented 6 years ago

Thanks for the comments and reviews everyone!

[Legacy Jira: Alan Woodward (@romseygeek) on Feb 02 2018]

mikemccand / stargazers-migration-test

UnifiedHighlighter should use single OffsetEnum rather than List<OffsetEnum> [LUCENE-8145] #146

Legacy Jira details