knaw-huc / textannoviz

GNU General Public License v3.0
1 stars 1 forks source link

Discrepancy in full-text search results #134

Open kintopp opened 1 month ago

kintopp commented 1 month ago

I've came across an unexpected discrepancy between local (i.e. based on a downloaded text version of a Globalise inventory) and online search results in the Transcriptions Viewer.

As an example, I’ve attached 9945.txt. If you do a whole word search in it using your favourite text editor, you’ll find 37 instances of "reuk" (i.e. as a whole word, not a compound). But if you carry out the same search online in the transcriptions viewer, you only get 13 hits:

Now take a look at line 4177 of the attached text file. Here, you’ll find this line, which is on page 77 of that inventory:

droog en met een benaude Reuk, dierhalven niet

Page 77 never shows up at all in the results (linked above) of the Transcriptions viewer. But if you navigate to inventory 9945, page 0077 in the viewer, you can see it's right there, a few lines down from the top: https://transcriptions.globalise.huygens.knaw.nl/detail/urn:globalise:NL-HaNA_1.04.02_9945_0077

So what is going on? At first, I thought, aha, the Transcriptions viewer has suddenly (due to a misconfiguration?) become case sensitive, since page 77 says, “een benaude Reuk”. But that’s not the issue. If I limit my own full-text search on the downloaded file to “Reuk” I get 7 results for inv. 9945. That’s still not the same as the Transcription viewer’s 13 hits. Moreover, if you look again at what it is finding, you can see quite clearly that it’s also highlighting upper-case instances of the word (see the results for page 78, for example).

I tried another test. I asked the transcriptions viewer to find “droog en met een benaude Reuk, dierhalven niet” (without the quotes) limited to inv. 9945. Now it did find that line, and even put it where it belonged, right at the top of the search results. I then tried the same thing again (with the quotes, as an exact phrase) and this also worked correctly. Finally, you see the discrepancy without the addition of the filter too. I don't have the exact figures with me now, but if I search for "reuk" locally, across all text inventory files, I get significantly more results than if I carry out the same search in the Globalise Transcriptions Viewer.

kintopp commented 1 month ago

Here's another example. If I search just for bandar (with or without quotes) in the Transcriptions Viewer, I get 415 results. If I search the same HTRv2 files locally on my Mac for bandar, whole-word, case-insensitive, I get 543 results. The inventories with the three highest number of instances are:

1103 (21) 1112 (21) 1111 (19)

If I search for bandar in the Transcriptions Viewer, each time limiting my search to these three inventory numbers, I get:

1103 (17) 1112 (15) 1111 (17)

So let's look at the other end of the scale. Based on my local copy of inv. 9893, I should find one instance of bandar there. In the Transcriptions Viewer, I do not. But if I look at the found instance in my text file, I see it looks like this:

badde pattoe aan rdijejewickreme Bandar„

In the HTR this can be found here. The reading order is bad on this page, but if I adjust the position of the scan relative to the transcription, Bandar can be clearly seen:

Screenshot 2024-08-04 at 13 43 07

So one issue, at least, is that ES is treating the at the end of Bandar„ as being part of the word, and thus if you're looking for Bandar it's not finding it. This also applies to other trailing (and perhaps leading?) punctuation characters such as ¬ that are 'attached' to words. Take a look at inv. 1108 (the numbers after the 1108 are the line numbers of the text file from Dataverse of that inventory):

1108:6063: Cargatoen, ende den Coopman Gerrit Corsz naer bandar¬ 1108:14414: leijden uijt Bandar Gamcon weder om naer 1108:14687: en Stadt Ogle bandar leggende opde reviere 1108:43173: ende met soo een Notabelen parthije zyde In Bandar Gamron 1108:43183: In Bandar Conge: ganderen, Door welcken abundanten toeboer 1108:43310: wat marct uwe coopmanschappen, Jn Bandar Gamron, Caelen 1108:43370: ende tydelyck In Bandar Gamron, affcomen, om soo veel vande 1108:43456: alder buijtersten vlt:o Novembr, In Bandar Gamron, om 1108:43688: In Bandar Gamron, ofte op apparentie van avantagienser 1108:49571: Bengala, Cogle Bandar), gelegen aende reviere ganges getu„ 1108:49582: andere oock ogle Bandar met onse t achten te bevaren, ende 1108:50306: onse aengebrachte cargezoenen meest in Bandar gamron tegens 1108:55447: desselfs bandar, ende op de sijne stroomen 1108:57449: Coopluijden van menichvuldige plaetsen in Bandar Gemoon comen om haere

There are 14 results here, but the Transcriptions Viewer lists 10. Three of the four it did not appear find are:

bandar¬ Bandar), bandar,

But all the remaining instances of bandar look completely normal. So what is going on? There is a small, additional factor at play here. If you look again at the Transcriptions Viewer results for bandar in inv. 1108 you'll see that it says it found 10 results but actually shows 11 results in that inventory (14 minus the three examples above with trailing non-letter characters). That, in turn, may be because on one page (1108:0952) the word shows up twice, but is only counted once in the results.

In any case, it seems clear that we need to offer our users a more detailed and fuller explanation of what can be searched for and what cannot, and consequently what counts as a 'result'. And then suggests workarounds, insofar as these are available.

svandaalen commented 1 month ago

Pardon my late reply. It's hectic at the moment with two nearing deadlines :).

As stated in my DM last week, the issue is twofold:

  1. The first issue is the ES tokeniser we are using. We use the whitespace tokeniser, which includes/leaves the interpunction in the token, as you can see in the example they give. Their standard tokeniser excludes/strips the interpunction from the token (again, see the example they give). Switching to the standard tokeniser will (most likely) fix the problem where words that include interpunction are not found. We will experiment with this later this year. For now, this can be sort of 'fixed' by searching for something like reuk*, but this will not work for everything.

  2. The way ES gives back hits. ES works with documents, meaning that when a user searches for something with ES, ES will return all the documents in which that query has a hit. For Globalise, the ES document equals a page from an inventory. If a word occurs more than once on that page, it's still one hit for ES because the document is the hit, not all individual matching words in that document. If I change your reuk query from above to search for reuk* to circumvent the interpunction issue from above (https://transcriptions.globalise.huygens.knaw.nl/?indexName=docs-2024-03-18&fragmentSize=100&from=0&size=10&sortBy=_score&sortOrder=desc&query=eyJkYXRlRnJvbSI6IjE1MDAtMDEtMDEiLCJkYXRlVG8iOiIxODAwLTAxLTAxIiwicmFuZ2VGcm9tIjoiMCIsInJhbmdlVG8iOiIzMDAwMCIsImZ1bGxUZXh0IjoicmV1ayoiLCJ0ZXJtcyI6eyJpbnZOciI6WyI5OTQ1Il19fQ%3D%3D), you see that reuk* occurs on 16 pages. As you can also see, reuk* often occurs more than once on a page. If you count all the hits of reuk* in TAV, you will count 37 individual hits, so TAV does find all instances of reuk* in this inventory. I am currently unsure whether ES can return all individual hits. We might make it clearer for the user by tackling #72. This issue will be worked on later this year.

If this issue has a high priority, please take it up with Hennie so he can try to fit it into our tight schedule :). I am available again on Tuesday.

kintopp commented 1 month ago

Thanks, Sebastiaan – no need to reply on the weekend! And super ironically, I just see now, consulting the Transcription Viewer's Help, that'd we'd already identified this back when the viewer launched in October last year, but that I'd forgotten about it. We'll discuss this some more inside Globalise and get back to you.

marijnkoolen commented 1 month ago

I'd like to bump the priority of this, as I'm fairly sure that few users expect punctuation to make a difference. It has a significant impact for REPUBLIC as well.

I don't know what the reason is for choosing the whitespace tokenizer, but hyphenated words (where I can see the benefit of leaving in 'punctuation') are rare in the resolutions, and punctuation is fairly common. Also, sometimes words are accidentally concatenated with punctuation in between (e.g. "doen.Is") by improper merging of lines , so I think that for REPUBLIC it is safe and almost always beneficial to switch to the ES default tokenizer.

I'll let Hennie know.