steelThread / mimeograph

CoffeeScript lib for PDF OCR and text extraction
http://steelthread.github.com/mimeograph/
28 stars 2 forks source link

Make PDFs internally "Searchable" #9

Closed morologous closed 13 years ago

morologous commented 13 years ago

OCRing and extracting text may not be enough to satisfy requirements.

Consider using pdfocr (which depends on ocropus) to make pdfs internally searchable.

https://launchpad.net/~gezakovacs/+archive/pdfocr

morologous commented 13 years ago

BINGO

http://www.exactcode.de/site/open_source/exactimage/hocr2pdf/

morologous commented 13 years ago

so, it looks like we can use pdfocr, which is a ruby script to perform the whole deal for us, or alternatively, we can use the component parts of pdfocr (cuneiform, hocr2pdf, etc) to manipulate the images we have already broken up.

We could use cuneiform to create hocr compatible text input, then combine that with the image and create a single page pdf that could then be stitched back together to form a complete searchable pdf.

steelThread commented 13 years ago

Ok, think I have a good idea about how to go about adding the text behind. Tesseract v3.00 supports hocr so we can use it to generate both the plain ocr text and the hocr markup. The only new cli utils we will need is extact-image's hocr2pdf and pdftk.

Still mulling over the design. There are a couple of approaches and I want to really think through them before I start hacking.

steelThread commented 13 years ago

Ok, spent the last 2 days trying to get various things running on my mac. What I've discovered is that the hocr support in tesseract isn't working too well as the text behind solution that feeds hocr2pdf. It has some problems with some fonts and I'm getting a consistent seg fault for one of the pages in my sample pdf.. Apparently there is some additional font support in 3.01. Going to try that next.

I also installed and tried coneiform. Unfortunately the version macports installs gives a consistent seg fault, which appears to be a known issue. I may try to build the svn trunk and see what happens.

All in all things haven't been going to well on the text behind front :(

steelThread commented 13 years ago

tesseract 3.01 is much improved over 3.00 as far as hocr is concerned. Getting really good text behind results and no seg faults. Going to continue down this path for now.