Convert pdf preprocessor

xf0e commented 5 years ago

The tesseract engine now can be confronted with pdf files. This is achieved by a new ConvertPdf preprocessor.

Usage:

The preprocessor binary should be started with "-preprocessor convert-pdf" and afterwards it can be tested with:
- curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://localhost:8000/test.pdf","engine":"tesseract", "preprocessors":["convert-pdf"]}' http://localhost:8080/ocr

Internal we are calling gs to create a multi page TIFF from our input. The ImageMagick won't work for this purpose because it creates a single paged image files which tesseract can't handle. e.g.

Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 Image too large: (2480, 77176) Error during processing.

Regards!

tleyden commented 5 years ago

Thanks for the contribution! I verified that it builds locally, and triggered new docker images on dockerhub. (still processing)

OSevangelist commented 5 years ago

Hi guys, great work! i Tried this feature but even for very small PDFs (i.e. 2 pages) i got

Unable to perform OCR decode. Error: Timeout waiting for RPC response

Any ideas why this happens. I use tesseract3 insides the containers

tleyden commented 5 years ago

Any logs on the containers? I'm guessing it failed with some sort of error that didn't get propagated back.

darmanovic commented 5 years ago

@tleyden

I have same issue. Logs: `

OCR_HTTP: serveHttp called
OCR_CLIENT: dialing "amqp://admin:Phaish9ohbaidei6oole@rabbitmq/"
OCR_CLIENT: callbackQueue name: amq.gen-Y6bVZfgmdLjnzjZrj5_gsQ
OCR_CLIENT: looping over deliveries..
OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: [convert-pdf]
OCR_CLIENT: publishing with routing key "convert-pdf"
OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80
ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40`

tleyden commented 5 years ago

Can you get logs on the worker container? Or maybe there isn't one running, which would explain the timeout.

What does docker ps return?

darmanovic commented 5 years ago

Worker container log is:

27T22:14:11.302615900Z 22:14:11.302272 OCR_WORKER: Creating new OCR Worker
22:14:11.302392 OCR_WORKER: Run() called...
22:14:11.302409 OCR_WORKER: dialing "amqp://admin:Phaish9ohbaidei6oole@rabbitmq/"
22:14:11.320177 OCR_WORKER: got Connection, getting Channel
22:14:11.322389 OCR_WORKER: binding to: decode-ocr
22:14:11.323148 OCR_WORKER: Queue bound to Exchange, starting Consume (consumer tag "foo")

I have 4 containers running, docker ps outputs (some colums cleared for clarity)

b0055fbecbde .  tleyden5iwx/open-ocr-2              docker-compose_openocr_1
b8be2302936c .  tleyden5iwx/open-ocr-preprocessor   docker-compose_strokewidthtransform_1
ae51cccc7094    tleyden5iwx/open-ocr-2              docker-compose_openocrworker_1
9904e5507ac7 .  rabbitmq:3.6.5-management           docker-compose_rabbitmq_1

Line

command: "/opt/open-ocr/open-ocr-preprocessor -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ -preprocessor stroke-width-transform"

of docker-compose.yml shoud be changed to:

command: "/opt/open-ocr/open-ocr-preprocessor -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ **-preprocessor convert-pdf"

if I am right?

xf0e commented 5 years ago

hello darmanovic, sorry, i edited the first post. The preprocessor args should be "-preprocessor convert-pdf" and should not contain "**". The stars are just typos.

darmanovic commented 5 years ago

I suspected that stars are typos, but when I remove them, container won't run at all.

LINE:

    command: "/opt/open-ocr/open-ocr-preprocessor -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ -preprocessor convert-pdf"

LOG:

15:52:17.985590 PREPROCESSOR_WORKER: Creating new Preprocessor Worker
15:52:17.986118 PANIC: Could not create rpc worker: No preprocessor found for: "convert-pdf" -- main.main() at main.go:47
panic: Could not create rpc worker: No preprocessor found for: "convert-pdf"
2019-02-28T15:52:17.990229700Z 
goroutine 1 [running]:
runtime.panic(0x627e80, 0xc210042940)
/usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/couchbaselabs/logg.LogPanic(0x7374d0, 0x1f, 0x7efe16e9ae78, 0x1, 0x1)
/opt/go/src/github.com/couchbaselabs/logg/logg.go:136 +0xec
main.main()
/opt/go/src/github.com/tleyden/open-ocr/cli-preprocessor/main.go:47 +0x200

bplukasz commented 5 years ago

Same error as @darmanovic. Someone solved it?

nevvermind commented 5 years ago

Hi, all. Please have a look at https://github.com/tleyden/open-ocr/issues/117 for a follow-up on this error.

tleyden / open-ocr

Convert pdf preprocessor #108