Closed LloydThinks closed 10 months ago
(I'm in the process of moving all the text extraction stuff to https://github.com/scambier/obsidian-text-extractor)
So, the PDFs are processed by https://github.com/jrmuizel/pdf-extract, and there's indeed little information in case of an error, because the feedback given by the lib is mostly useless. There are 2 possibilities if it returns an empty text:
The solution I'd like to implement is to fallback to using OCR on failed PDFs.
Ah okay. A limitation for now I will learn to live with. Better to index 85% of the files instead of 0% of the files.
Thank you for the great plugin!
@scambier Just for confirmation would this mean that if a PDF was made up of mostly just images that it would be able to parse the text from said images within the PDF? Not sure if this ties into this request here or if should be a separate feature request; if so let me know and I can remove this comment and raise a separate feature request.
Edit: So apparently this can get confusing as with PDFs that have been edited so that they have images but OCR via a seperate tool or process such as Adobe has been applied to them to make the images into text, this tool and in turn OmniSearch finds this fine. However lets say you have that and a image with the words XLS in it, and you can't open up a PDF viewer and highlight that text (like you would with said other text). In this case you couldn't extract the text from these images.
Image files (.png, .jpg) go through OCR, and PDFs go through the text extraction (no OCR involved at all). Image data in PDFs is completely ignored. I've had conflicting reports on "pre-OCRed PDFs", I guess it depends on the tool and the OS.
I mentioned this in another issue, but I feel like I should bring it up here too since it might help someone. I have found that for numerous PDFs that Text Extractor fails to extract, it is possible to get it to consistently extract text using the following workaround:
In my testing, this consistently fixes the issue. It is currently a fairly manual process, but I imagine it could scripted and/or introduced into Text Extractor or a companion plugin if we can reproduce Xournal++'s PDF export behavior (and confirm that this fixes the issue more broadly instead of just for the problem PDFs that I tested).
Issues related to PDF extraction are centralized here: https://github.com/scambier/obsidian-text-extractor/discussions/21
Problem description:
Omnisearch appears to fail on many PDFs. I understand that some PDFs, particularly from decades ago, will not be able to be indexed. However, I am finding that ~15% of my PDFs are not being indexed, which concerns me.
I am well aware that this is likely not a bug with Omnisearch itself, but rather a limitation of the PDF indexing library, or something similar. However, I would like to know if there is anything I can do to understand what is going on. The Developer Console logs provide no additional information.
Your environment:
689 files total
before indexing.Thank you, Lloyd