Make PDFs searchable - Githubissues

jpmckinney commented 8 years ago

e.g. using https://github.com/overview/pdfocr

A quick search on GitHub shows that there are a lot of solutions to this.

jpmckinney commented 8 years ago

You don't need much in the way of Scala to use pdfocr, but I'll grant it's more than zero because we don't provide a .jar anywhere.

There are a lot of libraries that purport to make PDFs searchable. If you follow the dependency trail, they all end up depending on Tesseract. And they're all buggy in their own ways. Most are garbage.

I built pdfocr for its interface, not its implementation. It just so happened that a Java implementation suited our needs [...].

Anyway, while making searchable PDFs may sound useful, it's not particularly interesting in 2015. Few people actually want searchable PDFs; they just want to read documents. Most hosting services slice PDFs into images; some [...] add text as a separate layer of HTML <div>s. That's a way better way of serving PDFs in 2015, because it's faster.

At the moment, I know no better free solution than pdfocr for making a PDF searchable. I could be missing a decent project or two; I was rather Java-focused when I searched for one. There are other free scripts out there that use Tesseract+hOCR, but they're not one-off commands. Then there are tons of non-free resources that can do the trick.

-- Adam

jpmckinney commented 8 years ago

Making PDFs searchable serves the use case of a person opening a PDF and searching within it. The alternative is to instead have them search within the PDF in some online platform that renders the PDF with text, or to have them search within the OCR'ed text.

jpmckinney commented 8 years ago

Personally, I only want to search over all documents at the moment, and can read individual documents. Can reopen if I develop use cases for searching within individual documents.

jpmckinney / information_request_summaries_and_responses

Make PDFs searchable #27