Closed anarchivist closed 6 years ago
In that case, how is a multiterm query any different than a phrase query?
In a phrase query, the terms are found in sequence and each sequence counts as a single hit. If individual terms from that phrase appear alone elsewhere on the page, they are not highlighted and don't count as hits.
ANDed terms don't have to be in sequence or even near each other on the page, but they must all be present on the page.
Ref #56 and #42
I just posted an example of this in #55 today.
{
"@type": "search:Hit",
"annotations": [
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1660.33,1094.11,140.42,33.89",
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1824.15,1094.11,163.82,33.89"
],
"before": "11,46.81,33.89",
"after": "humanity,"
},
{
"@type": "search:Hit",
"annotations": [
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1660.33,1094.11,140.42,33.89",
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1824.15,1094.11,163.82,33.89",
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/2011.38,1094.11,210.62,33.89"
],
"before": "",
"after": ""
},
"ocrtext_en": [
"11,46.81,33.89 <em>crimes☞1660.33,1094.11,140.42,33.89</em> <em>against☞1824.15,1094.11,163.82,33.89</em> humanity,☞2011",
"89,44.90,32.48 <em>crimes☞944.31,1334.89,134.70,32.48</em> <em>against☞1123.92,1334.89,157.15,32.48</em> humanity,☞1303",
"78,44.90,32.48 <em>crimes☞1977.04,1529.78,134.70,32.48</em>\n<em>against☞674.90,1594.74,157.15,32.48</em> humanity,☞854.51",
"67,44.90,32.48 <em>crimes☞1505.58,1724.67,134.70,32.48</em> <em>against☞1662.73,1724.67,157.15,32.48</em> humanity,☞1842",
"56,44.90,32.48 <em>crime☞1595.38,1919.56,112.25,32.48</em> <em>against☞1730.08,1919.56,157.15,32.48</em> humanity,☞1909",
"00 the☞1058.94,573.00,65.32,33.00 <em>crimes☞1146.04,573.00,130.65,33.00</em> to☞1298.46,573.00,43.55,33.00 which☞1363"
],
"ocrtext_en": [
"<em>crimes☞1660.33,1094.11,140.42,33.89 against☞1824.15,1094.11,163.82,33.89 humanity,☞2011.38,1094.11,210.62,33.89</em>",
"<em>crimes☞944.31,1334.89,134.70,32.48 against☞1123.92,1334.89,157.15,32.48 humanity,☞1303.52,1334.89,202.06,32.48</em>",
"<em>crimes☞1977.04,1529.78,134.70,32.48\nagainst☞674.90,1594.74,157.15,32.48 humanity,☞854.51,1594.74,202.06,32.48</em>",
"<em>crimes☞1505.58,1724.67,134.70,32.48 against☞1662.73,1724.67,157.15,32.48 humanity,☞1842.34,1724.67,202.06,32.48</em>",
"<em>crime☞1595.38,1919.56,112.25,32.48 against☞1730.08,1919.56,157.15,32.48 humanity,☞1909.69,1919.56,202.06,32.48</em>"
],
Question: Why do we only get two highlights when searching without quotes?
It seems like the behavior is slightly different with the UnifiedHighlighter referenced on #55 ...
@jkeck asked me for clarification around expected outcomes or acceptance criteria for this ticket. First, I think this is (for now) analysis to confirm behavior. We should look into the following:
MUST
appear within annos on a given canvas. For example, if you have the query UNTAET Regulation 2000/15
, all three of those terms MUST
appear as annos on that canvas, but they don't need to appear as phrase.humanity
not being highlighted for crimes against humanity
).. My comment above indicates that I think the UnifiedHighlighter
will change this but I'm not certain.I believe in order to accomplish "all unquoted terms must exist in a single canvas in order for a result to appear" we may need to set the mm
to 100% (or rather unset our current mm
of 2<-1 5<-2 6<90%
)
The default value of mm is 100% (meaning that all clauses must match).
I can understand why this wouldn't normally be desirable in a discovery environment, but in the content search case (w/ an autocomplete available) perhaps this is what we would want.
i think in searchworks the mm applies over a threshold of 7 or 8 terms (can't remember where it is now) - that is, below the threshold, all terms must be present for the item to be found; above the threshold one or more terms may be missing. i think the same makes sense here. the examples i've seen quoted are all short - 3 or 4 terms - where 100% would be expected.
autocomplete adds a new wrinkle in that it changes user expectation. google returns docs that don't have all the terms in the selected autocomplete query, but they indicate in the results that words are missing.
What do we want to do about matches that cross page boundaries (if anything)? For example "crimes against humanity" exists as a phrase, but "crimes against" appear as the last words of one page, with "humanity" appearing as the first word of the next page?
In our case, if that did happen, "crimes against" would exist in Document A, and "humanity" would exist in Document B. Is there anything we could do about this?
are the documents indexed in the same way in the Spotlight search? each page individually? I'm wondering if we're setting up a mismatch in behaviour between Spotlight and the viewer.
The documents are indexed at the page/canvas level in content search because it is "search within this document" and the level of discovery is a canvas (e.g. I want to enter a search query and be returned pages that have that term on them).
Exhibits/Spotlight is search across as opposed to search within, so the indexing is done at the document level.
I'm not sure how we meet both these requirements simultaneously:
These seem to be opposite requirements to me.
So Spotlight is searching the document for the existence of three terms: A B C. They are ANDed, so must all be present in the document. Doc may be returned if A is on page 4, B on page 27, C on pages 3 and 15.
Move to the viewer, where user will expect to find the three terms in the document. They do the same search, they get no hits, because all three terms are not present on any one page.
That seems like a problem. @ggeisler, your thoughts?
I think both requirements above could be wrong for different reasons.
"ANDed terms don't have to be in sequence or even near each other on the page, but they must all be present on the page."
In the context of the viewer only, it technically makes sense, since the page is the unit of discovery and the unit of discovery should, in theory, include all ANDed terms. But it doesn't make sense in the overall flow where the user will find a document for their query, then potentially find no instances of that query within the document.
ANDed terms that are split across pages should return both pages.
Then we are essentially treating AND as a phrase. With an AND search you can reasonably expect other words to fall between the terms. How much proximity do we require to highlight terms across 2 pages?
Yeah, these can't both be true.
I agree it's not an ideal situation and is likely to cause confusion for the user who enters the same multiple terms in both Spotlight search and the viewer search. Given we're using a different unit of discovery in the two contexts, I don't have any ideas for a good solution.
I'm not 100% sure if this is possible, but I wonder if we (in discovery environments) index each canvas into individual multi-valued fields then configure solr to only consider returning results/highlights for hits w/i that instance of the field, and not necessarily in fields across the multiple values. positionIncrementGap
aims to deal w/ this, although I think that may be for phrase queries only.
Are we at the point that we need to do some more analysis of the desired behavior? I feel like we're beginning to define it, but not sure it has been explicitly documented anywhere (or maybe that exists somewhere, and we can use that to generate some acceptance criteria).
From 2/20 sprint planning: the work that remains is the fixing the bug that @aeschylus and @jkeck identified, and exploring with the implementation of the unified highlighter.
When I have a multiterm query where terms are separated by spaces that is not a phrase query wrapped in quotes, page-level matches include cases where all the terms are matched (i.e., the terms should be
AND
ed together).Discussed w/ @jvine and @ggeisler on 1/31.