Integrating libreoffice/pdftk for pdf, docx, etc support

tleyden / open-ocr

Run your own OCR-as-a-Service using Tesseract and Docker

Apache License 2.0

1.34k stars 224 forks source link

Integrating libreoffice/pdftk for pdf, docx, etc support #39

Open jlyon opened 9 years ago

jlyon commented 9 years ago

Realted to #17.

I wrote a bash script awhile ago (https://github.com/jlyon/ocr-anything) that would analize the mimetype headers on the document and:

For images, run tesseract
For image pdfs, split each page out into an image and run tesseract
For pdfs with embedded text, extract the text with pdftk
For Office documents (docx, Excel, etc), extract the text with Libre Office (run as headless)

I would like to try to integrate the efforts with open-ocr. What do you think is the best way to go about this? This is my thinking:

Mimetype analyzer preprocessor: analyzes the mimetype, sets additional preprocessors and the engine
PDF splitter preprocessor: queues up multiple images as their own engine task (for multipage scanned pdfs)
PDF engine: extracts embedded pdf text
LibreOffice engine: for docx, etc files

Does this seem right?

tleyden commented 9 years ago

Sorry about the delay, this is getting into an area I haven't played with much yet.

Overall though, this seems like a good approach.

I'd be excited to see this happen to get the preprocessor stuff a little more fleshed out and documented.

jlyon commented 9 years ago

Great. I'm heading in vacation for a few weeks, but I'll start taking a look when I return. On Jul 6, 2015 10:19 AM, "Traun Leyden" notifications@github.com wrote:

Sorry about the delay, this is getting into an area I haven't played with much yet.

Overall though, this seems like a good approach.

I'd be excited to see this happen to get the preprocessor stuff a little more fleshed out and documented.

— Reply to this email directly or view it on GitHub https://github.com/tleyden/open-ocr/issues/39#issuecomment-118929042.

AaronToledoPIDS commented 9 years ago

mcantrell commented 9 years ago

+1 as well.. this would be really nice to have

ddohler commented 8 years ago

I'd be interested in seeing this too; I wrote something similar in Django a while back (https://github.com/ddohler/webocr). One minor subtlety is that some PDFs contain embedded images and text on the same page. As I recall, my solution was to rasterize the entire page and then OCR the resulting image, but there are definitely other ways to attack the problem. Next time I have a spare moment I'm planning to try replacing my home-grown Tesseract setup with open-ocr, so I'll hopefully be in a position where I can make occasional contributions.

tleyden commented 8 years ago

@ddohler looking forward to seeing that!