Open KatharineV opened 7 months ago
Update: I can confirm that Tesseract isn't running as expected for recent uploads from the OAI issue ingest. This work has a keyword that appears in a catalog search (thanks to the OCR'd txt file attached) but persistent highlighting fails when you open the work. Searching in the viewer also reveals zero results of a word we know is present.
@KatharineV thanks for confirming. So what I think is actually happening is that because we installed the PDF.js logic, there's a process where the READER pdf's text (which was already OCR'd in the first place) gets indexed so that's why it shows up on the catalog. I'm not sure if the TXT file's text actually gets indexed or not. This makes me think that maybe as a backup plan we can turn on PDF.js instead of the UV. However, again the text only works if the PDF already had text on it (like on most of the READER files).
@kirkkwang That's interesting! I thought the TXT files were indexed, because those are the highest quality (sometimes hand-edited and improved) full-text files we have for some works. They're supposed to be useful for searching...not sure why we'd upload them if they aren't indexed, so I'd love to hear for sure if they are.
Aso, we have many works that only have the Archival PDF or have PDFs without either "archival" or "reader" in the file name. It wouldn't be a good idea for us to build logic that relies on the Reader versions. Is there a way that we can ensure the works without Reader PDFs get their files (whatever the format and whatever the file name) OCR'd by Tesseract and indexed by the site so keywords can show up in the catalog search and the viewer? I think this is the behavior we'd planned in the past and it was working before. I could be wrong. Thanks for looking into it.
@kirkkwang Good news! I found tickets about indexing the TXT files. Maybe they're relevant? https://github.com/scientist-softserv/adventist_knapsack/issues/503 and https://github.com/scientist-softserv/adventist_knapsack/issues/480.
https://github.com/scientist-softserv/adventist_knapsack/issues/588 also seems related to expected search behavior. I hope you don't mind me dropping these links here--it's more a reminder for me than for you! I'm trying to track back and remember what search is supposed to do so I can compare it to what I'm seeing now...
oh great that's good news, i didn't don't remember that but that's for the context, well i did also add indexing of any PDF text if that exists as well, we're currently working on this issue so hopefully we'll come up with something soon
@KatharineV just giving you an update, we have a solution in the pipeline, just have to update gems and deploy
Tested the fix on ADL prod with a work uploaded on 12/11. The OCR search in the viewer is working beautifully!
Users have reported an issue that I can also see. Although the UV has split documents, the Tesseract OCR is not searchable in the viewer or in the catalog search.
Here is a work that was uploaded several days ago, so processing is finished, but you can see that the UV search and the catalog search are not bringing up any full text results. This work has no attached TXT file of OCR results. It relies entirely on Tesseract, and that version of OCR is not working.
Screenshots show the catalog search (images 1 & 2) and then the in-viewer search (3 & 4)
1.
2.
3.
4.
![Image](https://github.com/scientist-softserv/adventist-dl/assets/118196520/c4d735dd-fae1-4dd9-9412-d21e1568da61)