sul-dlss / content_search

IIIF Content Search API implementation for OCR in DOR
Other
10 stars 1 forks source link

Confirm behavior of multiterm queries that are not phrase queries #63

Closed anarchivist closed 6 years ago

anarchivist commented 6 years ago

When I have a multiterm query where terms are separated by spaces that is not a phrase query wrapped in quotes, page-level matches include cases where all the terms are matched (i.e., the terms should be ANDed together).

Discussed w/ @jvine and @ggeisler on 1/31.

cbeer commented 6 years ago

In that case, how is a multiterm query any different than a phrase query?

jvine commented 6 years ago

In a phrase query, the terms are found in sequence and each sequence counts as a single hit. If individual terms from that phrase appear alone elsewhere on the page, they are not highlighted and don't count as hits.

ANDed terms don't have to be in sequence or even near each other on the page, but they must all be present on the page.

anarchivist commented 6 years ago

Ref #56 and #42

camillevilla commented 6 years ago

I just posted an example of this in #55 today.

IIIF search response

without quotes

{
"@type": "search:Hit",
"annotations": [
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1660.33,1094.11,140.42,33.89",
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1824.15,1094.11,163.82,33.89"
],
"before": "11,46.81,33.89",
"after": "humanity,"
},

with quotes

{
"@type": "search:Hit",
"annotations": [
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1660.33,1094.11,140.42,33.89",
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1824.15,1094.11,163.82,33.89",
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/2011.38,1094.11,210.62,33.89"
],
"before": "",
"after": ""
},

Solr response

Without quotes

"ocrtext_en": [
"11,46.81,33.89 <em>crimes☞1660.33,1094.11,140.42,33.89</em> <em>against☞1824.15,1094.11,163.82,33.89</em> humanity,☞2011",
"89,44.90,32.48 <em>crimes☞944.31,1334.89,134.70,32.48</em> <em>against☞1123.92,1334.89,157.15,32.48</em> humanity,☞1303",
"78,44.90,32.48 <em>crimes☞1977.04,1529.78,134.70,32.48</em>\n<em>against☞674.90,1594.74,157.15,32.48</em> humanity,☞854.51",
"67,44.90,32.48 <em>crimes☞1505.58,1724.67,134.70,32.48</em> <em>against☞1662.73,1724.67,157.15,32.48</em> humanity,☞1842",
"56,44.90,32.48 <em>crime☞1595.38,1919.56,112.25,32.48</em> <em>against☞1730.08,1919.56,157.15,32.48</em> humanity,☞1909",
"00 the☞1058.94,573.00,65.32,33.00 <em>crimes☞1146.04,573.00,130.65,33.00</em> to☞1298.46,573.00,43.55,33.00 which☞1363"
],

with quotes

"ocrtext_en": [
"<em>crimes☞1660.33,1094.11,140.42,33.89 against☞1824.15,1094.11,163.82,33.89 humanity,☞2011.38,1094.11,210.62,33.89</em>",
"<em>crimes☞944.31,1334.89,134.70,32.48 against☞1123.92,1334.89,157.15,32.48 humanity,☞1303.52,1334.89,202.06,32.48</em>",
"<em>crimes☞1977.04,1529.78,134.70,32.48\nagainst☞674.90,1594.74,157.15,32.48 humanity,☞854.51,1594.74,202.06,32.48</em>",
"<em>crimes☞1505.58,1724.67,134.70,32.48 against☞1662.73,1724.67,157.15,32.48 humanity,☞1842.34,1724.67,202.06,32.48</em>",
"<em>crime☞1595.38,1919.56,112.25,32.48 against☞1730.08,1919.56,157.15,32.48 humanity,☞1909.69,1919.56,202.06,32.48</em>"
],

highlighting

Question: Why do we only get two highlights when searching without quotes?

without quotes

sample_search_without_quotes

with quotes

sample_search_with_quotes

anarchivist commented 6 years ago

It seems like the behavior is slightly different with the UnifiedHighlighter referenced on #55 ...

anarchivist commented 6 years ago

@jkeck asked me for clarification around expected outcomes or acceptance criteria for this ticket. First, I think this is (for now) analysis to confirm behavior. We should look into the following:

jkeck commented 6 years ago

I believe in order to accomplish "all unquoted terms must exist in a single canvas in order for a result to appear" we may need to set the mm to 100% (or rather unset our current mm of 2<-1 5<-2 6<90%)

https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Themm_MinimumShouldMatch_Parameter

The default value of mm is 100% (meaning that all clauses must match).

I can understand why this wouldn't normally be desirable in a discovery environment, but in the content search case (w/ an autocomplete available) perhaps this is what we would want.

jvine commented 6 years ago

i think in searchworks the mm applies over a threshold of 7 or 8 terms (can't remember where it is now) - that is, below the threshold, all terms must be present for the item to be found; above the threshold one or more terms may be missing. i think the same makes sense here. the examples i've seen quoted are all short - 3 or 4 terms - where 100% would be expected.

autocomplete adds a new wrinkle in that it changes user expectation. google returns docs that don't have all the terms in the selected autocomplete query, but they indicate in the results that words are missing.

aeschylus commented 6 years ago

What do we want to do about matches that cross page boundaries (if anything)? For example "crimes against humanity" exists as a phrase, but "crimes against" appear as the last words of one page, with "humanity" appearing as the first word of the next page?

jkeck commented 6 years ago

In our case, if that did happen, "crimes against" would exist in Document A, and "humanity" would exist in Document B. Is there anything we could do about this?

jvine commented 6 years ago

are the documents indexed in the same way in the Spotlight search? each page individually? I'm wondering if we're setting up a mismatch in behaviour between Spotlight and the viewer.

jkeck commented 6 years ago

The documents are indexed at the page/canvas level in content search because it is "search within this document" and the level of discovery is a canvas (e.g. I want to enter a search query and be returned pages that have that term on them).

Exhibits/Spotlight is search across as opposed to search within, so the indexing is done at the document level.

I'm not sure how we meet both these requirements simultaneously:

These seem to be opposite requirements to me.

jvine commented 6 years ago

So Spotlight is searching the document for the existence of three terms: A B C. They are ANDed, so must all be present in the document. Doc may be returned if A is on page 4, B on page 27, C on pages 3 and 15.

Move to the viewer, where user will expect to find the three terms in the document. They do the same search, they get no hits, because all three terms are not present on any one page.

That seems like a problem. @ggeisler, your thoughts?

I think both requirements above could be wrong for different reasons.

"ANDed terms don't have to be in sequence or even near each other on the page, but they must all be present on the page."

In the context of the viewer only, it technically makes sense, since the page is the unit of discovery and the unit of discovery should, in theory, include all ANDed terms. But it doesn't make sense in the overall flow where the user will find a document for their query, then potentially find no instances of that query within the document.

ANDed terms that are split across pages should return both pages.

Then we are essentially treating AND as a phrase. With an AND search you can reasonably expect other words to fall between the terms. How much proximity do we require to highlight terms across 2 pages?

Yeah, these can't both be true.

ggeisler commented 6 years ago

I agree it's not an ideal situation and is likely to cause confusion for the user who enters the same multiple terms in both Spotlight search and the viewer search. Given we're using a different unit of discovery in the two contexts, I don't have any ideas for a good solution.

jkeck commented 6 years ago

I'm not 100% sure if this is possible, but I wonder if we (in discovery environments) index each canvas into individual multi-valued fields then configure solr to only consider returning results/highlights for hits w/i that instance of the field, and not necessarily in fields across the multiple values. positionIncrementGap aims to deal w/ this, although I think that may be for phrase queries only.

Are we at the point that we need to do some more analysis of the desired behavior? I feel like we're beginning to define it, but not sure it has been explicitly documented anywhere (or maybe that exists somewhere, and we can use that to generate some acceptance criteria).

anarchivist commented 6 years ago

From 2/20 sprint planning: the work that remains is the fixing the bug that @aeschylus and @jkeck identified, and exploring with the implementation of the unified highlighter.