mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Expose OffsetsEnum directly from UnifiedHighlighter [LUCENE-8158] #159

Open mikemccand opened 6 years ago

mikemccand commented 6 years ago

Now that OffestsEnum is unitary, we can expose it directly from the UnifiedHighlighter, allowing clients to use them in place of snippets.


Legacy Jira details

LUCENE-8158 by Alan Woodward (@romseygeek) on Feb 05 2018 Attachments: LUCENE-8158.patch, LUCENE-8158-alt.patch

mikemccand commented 6 years ago

This patch adds an OffsetProducer abstraction which handles offsets for an individual Query.  Clients can call UnifiedHighlighter.getOffsetProducer(Query) to get one, and then call .getFieldOffsets() to get the offsets for a particular document and field.

I've tried to keep the API change as minimal as possible, which means keeping a reference to the UnifiedHighlighter and calling back to overrideable methods to create Automata lists and PhraseHelpers, but ideally we could pull all this into the OffsetProducer as well, and have the overrideable methods there.

[Legacy Jira: Alan Woodward (@romseygeek) on Feb 05 2018]

mikemccand commented 6 years ago

Here's an alternative, considerably more invasive patch, which moves all Offsets-related code into OffsetProducer.

[Legacy Jira: Alan Woodward (@romseygeek) on Feb 05 2018]

mikemccand commented 6 years ago

This looks interesting Alan.  I looked at both patches.  Lets consider the more invasive one. I like that UnifiedHighlighter is shorter as it was getting quite long, and this looks like a reasonable way to break it up.

BTW I've been thinking of a refactoring similar to your proposal here. It would be nice if there was a class that held the Query and the things we extract from it: position-insensitive terms (or all terms), MTQ automata, position-sensitive queries. It's job would be to extract all this information from the Query, and that's it. Today it has these aforementioned things but it would be a nice option to extract the term->boost as well so that we can customize the relevancy based on boosts in the query (like the original Highlighter & FVH can). Eventually it would be nice to "visit" the query tree once directly in this proposed class instead of several times (weight.extractTerms, MultiTermHighlighting, PhraseHelper). That would be the extent of the purpose of this thingamajig – perhaps called QueryExtractor. An instance of this class would go on each FieldOffsetStrategy and would obviate the need to pass so many arguments to these classes and for them to have as many fields as they do (not that there's that much but still). OffsetProducer here slightly overlaps with this idea but the ideas could be combined. OP could accept a QueryExtractor and would then have some less stuff to deal with internally since QueryExtractor extracts terms, automata, PhraseHelper, etc. not OP.

[Legacy Jira: David Smiley (@dsmiley) on Feb 05 2018]