scambier / obsidian-text-extractor

A (companion) plugin to facilitate the extraction of text from images (OCR) and PDFs.
GNU General Public License v3.0
335 stars 16 forks source link

[BUG] Omnisearch Fails to Extract from 15% of PDFs #7

Closed LloydThinks closed 10 months ago

LloydThinks commented 1 year ago

Problem description:

Omnisearch appears to fail on many PDFs. I understand that some PDFs, particularly from decades ago, will not be able to be indexed. However, I am finding that ~15% of my PDFs are not being indexed, which concerns me.

I am well aware that this is likely not a bug with Omnisearch itself, but rather a limitation of the PDF indexing library, or something similar. However, I would like to know if there is anything I can do to understand what is going on. The Developer Console logs provide no additional information.

Your environment:

Thank you, Lloyd

scambier commented 1 year ago

(I'm in the process of moving all the text extraction stuff to https://github.com/scambier/obsidian-text-extractor)

So, the PDFs are processed by https://github.com/jrmuizel/pdf-extract, and there's indeed little information in case of an error, because the feedback given by the lib is mostly useless. There are 2 possibilities if it returns an empty text:

The solution I'd like to implement is to fallback to using OCR on failed PDFs.

LloydThinks commented 1 year ago

Ah okay. A limitation for now I will learn to live with. Better to index 85% of the files instead of 0% of the files.

Thank you for the great plugin!

gwillcox-r7 commented 1 year ago

@scambier Just for confirmation would this mean that if a PDF was made up of mostly just images that it would be able to parse the text from said images within the PDF? Not sure if this ties into this request here or if should be a separate feature request; if so let me know and I can remove this comment and raise a separate feature request.

Edit: So apparently this can get confusing as with PDFs that have been edited so that they have images but OCR via a seperate tool or process such as Adobe has been applied to them to make the images into text, this tool and in turn OmniSearch finds this fine. However lets say you have that and a image with the words XLS in it, and you can't open up a PDF viewer and highlight that text (like you would with said other text). In this case you couldn't extract the text from these images.

scambier commented 1 year ago

Image files (.png, .jpg) go through OCR, and PDFs go through the text extraction (no OCR involved at all). Image data in PDFs is completely ignored. I've had conflicting reports on "pre-OCRed PDFs", I guess it depends on the tool and the OS.

L1Z3 commented 1 year ago

I mentioned this in another issue, but I feel like I should bring it up here too since it might help someone. I have found that for numerous PDFs that Text Extractor fails to extract, it is possible to get it to consistently extract text using the following workaround:

  1. Load the PDF into Xournal++
  2. Immediately export the PDF (File -> Export as PDF).
  3. Load the modified PDF into Obsidian
  4. "Right click -> Clear cache for this file" on the exported PDF
  5. "Right click -> Extract text to clipboard" on the exported PDF

In my testing, this consistently fixes the issue. It is currently a fairly manual process, but I imagine it could scripted and/or introduced into Text Extractor or a companion plugin if we can reproduce Xournal++'s PDF export behavior (and confirm that this fixes the issue more broadly instead of just for the problem PDFs that I tested).

scambier commented 10 months ago

Issues related to PDF extraction are centralized here: https://github.com/scambier/obsidian-text-extractor/discussions/21