scientist-softserv / adventist_knapsack

Apache License 2.0
1 stars 0 forks source link

Keyword search in UV not working with Tesseract OCR #528

Open KatharineV opened 7 months ago

KatharineV commented 7 months ago

Users have reported an issue that I can also see. Although the UV has split documents, the Tesseract OCR is not searchable in the viewer or in the catalog search.

Here is a work that was uploaded several days ago, so processing is finished, but you can see that the UV search and the catalog search are not bringing up any full text results. This work has no attached TXT file of OCR results. It relies entirely on Tesseract, and that version of OCR is not working.

Screenshots show the catalog search (images 1 & 2) and then the in-viewer search (3 & 4)

1. Image 2. Image 3. Image 4. Image

KatharineV commented 6 months ago

Update: I can confirm that Tesseract isn't running as expected for recent uploads from the OAI issue ingest. This work has a keyword that appears in a catalog search (thanks to the OCR'd txt file attached) but persistent highlighting fails when you open the work. Searching in the viewer also reveals zero results of a word we know is present.

Image

Image

Image

kirkkwang commented 6 months ago

@KatharineV thanks for confirming. So what I think is actually happening is that because we installed the PDF.js logic, there's a process where the READER pdf's text (which was already OCR'd in the first place) gets indexed so that's why it shows up on the catalog. I'm not sure if the TXT file's text actually gets indexed or not. This makes me think that maybe as a backup plan we can turn on PDF.js instead of the UV. However, again the text only works if the PDF already had text on it (like on most of the READER files).

KatharineV commented 6 months ago

@kirkkwang That's interesting! I thought the TXT files were indexed, because those are the highest quality (sometimes hand-edited and improved) full-text files we have for some works. They're supposed to be useful for searching...not sure why we'd upload them if they aren't indexed, so I'd love to hear for sure if they are.

Aso, we have many works that only have the Archival PDF or have PDFs without either "archival" or "reader" in the file name. It wouldn't be a good idea for us to build logic that relies on the Reader versions. Is there a way that we can ensure the works without Reader PDFs get their files (whatever the format and whatever the file name) OCR'd by Tesseract and indexed by the site so keywords can show up in the catalog search and the viewer? I think this is the behavior we'd planned in the past and it was working before. I could be wrong. Thanks for looking into it.

KatharineV commented 6 months ago

@kirkkwang Good news! I found tickets about indexing the TXT files. Maybe they're relevant? https://github.com/scientist-softserv/adventist_knapsack/issues/503 and https://github.com/scientist-softserv/adventist_knapsack/issues/480.

https://github.com/scientist-softserv/adventist_knapsack/issues/588 also seems related to expected search behavior. I hope you don't mind me dropping these links here--it's more a reminder for me than for you! I'm trying to track back and remember what search is supposed to do so I can compare it to what I'm seeing now...

kirkkwang commented 6 months ago

oh great that's good news, i didn't don't remember that but that's for the context, well i did also add indexing of any PDF text if that exists as well, we're currently working on this issue so hopefully we'll come up with something soon

kirkkwang commented 6 months ago

@KatharineV just giving you an update, we have a solution in the pipeline, just have to update gems and deploy

KatharineV commented 6 months ago

Tested the fix on ADL prod with a work uploaded on 12/11. The OCR search in the viewer is working beautifully!

Image