tleyden / open-ocr

Run your own OCR-as-a-Service using Tesseract and Docker
Apache License 2.0
1.34k stars 224 forks source link

Integrating libreoffice/pdftk for pdf, docx, etc support #39

Open jlyon opened 9 years ago

jlyon commented 9 years ago

Realted to #17.

I wrote a bash script awhile ago (https://github.com/jlyon/ocr-anything) that would analize the mimetype headers on the document and:

I would like to try to integrate the efforts with open-ocr. What do you think is the best way to go about this? This is my thinking:

Does this seem right?

tleyden commented 9 years ago

Sorry about the delay, this is getting into an area I haven't played with much yet.

Overall though, this seems like a good approach.

I'd be excited to see this happen to get the preprocessor stuff a little more fleshed out and documented.

jlyon commented 9 years ago

Great. I'm heading in vacation for a few weeks, but I'll start taking a look when I return. On Jul 6, 2015 10:19 AM, "Traun Leyden" notifications@github.com wrote:

Sorry about the delay, this is getting into an area I haven't played with much yet.

Overall though, this seems like a good approach.

I'd be excited to see this happen to get the preprocessor stuff a little more fleshed out and documented.

— Reply to this email directly or view it on GitHub https://github.com/tleyden/open-ocr/issues/39#issuecomment-118929042.

AaronToledoPIDS commented 9 years ago

+1

mcantrell commented 9 years ago

+1 as well.. this would be really nice to have

ddohler commented 8 years ago

I'd be interested in seeing this too; I wrote something similar in Django a while back (https://github.com/ddohler/webocr). One minor subtlety is that some PDFs contain embedded images and text on the same page. As I recall, my solution was to rasterize the entire page and then OCR the resulting image, but there are definitely other ways to attack the problem. Next time I have a spare moment I'm planning to try replacing my home-grown Tesseract setup with open-ocr, so I'll hopefully be in a position where I can make occasional contributions.

tleyden commented 8 years ago

@ddohler looking forward to seeing that!