scambier / obsidian-text-extractor

A (companion) plugin to facilitate the extraction of text from images (OCR) and PDFs.
GNU General Public License v3.0
346 stars 19 forks source link

[Feature request] Search existing PDFs with their own embedded OCR data #49

Closed figadore closed 11 months ago

figadore commented 11 months ago

Is your feature request related to a problem? Please describe. I already have tons of PDFs where there seems to be OCR data built-in. i.e. I can search within the PDF, and text is highlighted and can be copied to the clipboard. I am not able to search using this existing data in Obsidian. Using text extractor, many PDFs do not get OCR at all or get insufficient OCR results.

Describe the solution you'd like Use existing embedded OCR data when available

Describe alternatives you've considered None so far

Additional context Some existing data is from Doxie, some is from a Canon printer/scanner

Caveats Maybe I'm misunderstanding the fundamentals of how OCR works in my existing documents, or some other aspect of omnisearch

scambier commented 11 months ago

See https://github.com/scambier/obsidian-text-extractor/discussions/21

figadore commented 10 months ago

See #21

My understanding of #21 and #7 is that they are mostly related to OCR of images, extracting the main(?) text from a PDF, or falling back on performing OCR on a PDF again. On the other hand, it sounds like pdf-extract may be the component that is expected to extract the pre-OCRd text and help make it searchable?

scambier commented 10 months ago

Text Extractor does not perform OCR on PDFs, it just uses the pdf-extract library to (try to) extract existing text data that might be present. The problem is that this library doesn't work very well and often fails even on PDFs that contain clean or OCRed text.