Open mikemccand opened 6 years ago
Here's a pull request with my first sketch: https://github.com/apache/lucene-solr/pull/397
It's very minimal, needs lots of javadocs and testing, and doesn't score passages yet, but it should give an idea of what I'm trying to do.
[Legacy Jira: Alan Woodward (@romseygeek) on Jun 06 2018]
cc @jimczi @dsmiley
[Legacy Jira: Alan Woodward (@romseygeek) on Jun 06 2018]
@rcmuir do you have a comment on the highlight by field then doc vs doc then field? I believe you chose this arrangement in the PostingsHighlighter (the ancestor of the UH) and AFAICT this is optimized for offsets in postings. I'm not sure how much it matters. And I'm surprised Matches API would have any impact on the distinction (as Alan implies it would) but I haven't looked closely at this patch yet to see.
I'll look at your PR Alan. This is lighting a fire under my but to continue LUCENE-8286 – battle of the highlighters ;-)
[Legacy Jira: David Smiley (@dsmiley) on Jun 06 2018]
This highlighter is impressive for not a lot of code! Great work @romseygeek! Some observations:
BTW some complexity in the UH that I don't see here is related to query tree visiting, such as for MultiTermQueries and also for getting all the terms (granted the latter is easy and not much code). This information is put to good use by building a MemoryIndex collecting only the pertinent terms and not bothering with the rest.
If this highlighter moves forward, I figure at some point you're going to have to address visiting/walking queries (e.g. to look for MTQs) and/or perhaps rewriting them. Consider these related issues: LUCENE-8184 LUCENE-8160 LUCENE-3041
[Legacy Jira: David Smiley (@dsmiley) on Jun 21 2018]
I started trying to integrate the Matches API into the UnifiedHighlighter, but there's a fairly heavy impedance mismatch between the way the two of them work (eg Matches doesn't give you freqs, it's entirely lazy, the UH tries to do things by field rather than by doc). So instead, I thought I'd try and write a new highlighter based around Matches, and see what it looks like.
Legacy Jira details
LUCENE-8349 by Alan Woodward (@romseygeek) on Jun 06 2018, updated Jun 21 2018