tleyden / open-ocr

Run your own OCR-as-a-Service using Tesseract and Docker
Apache License 2.0
1.33k stars 223 forks source link

feature: support hocr configfile for tesseract #18

Closed evantill closed 9 years ago

evantill commented 9 years ago

how to pass configfileparameter to tesseract engine ?

tesseract imagename|stdin outputbase|stdout [options…] [configfile…]

see tesseract doc

I use the hocr config file

hOCR is an open standard of data representation for formatted text obtained from OCR

tleyden commented 9 years ago

I don't think it's supported out of the box, but this should be fairly easy to add and is a great addition. Can you post an example configfile on a github gist?

Also, can you try to find out if it's possible to tell it to use hOCR via -c configvar=value rather than a configfile?

tleyden commented 9 years ago

Btw the current api docs are here in case you didn't see the link: http://docs.openocr.apiary.io/

evantill commented 9 years ago

thanks for the -configvar=value workaround.

The hocr configuration file (provided with tesseract) contains :

tessedit_create_hocr 1
tessedit_pageseg_mode 1

but using config_vars does not work has expected (output format is still text not xml).

curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://i.imgur.com/xYAaDjV.png","engine":"tesseract","engine_args":{"psm":"3","lang":"fra"}, "config_vars": {"tessedit_create_hocr":"1","tessedit_pageseg_mode":"1"}}' http://192.168.59.103:$HTTP_PORT/ocr
{
    "config_vars": {
        "tessedit_create_hocr": "1",
        "tessedit_pageseg_mode": "1"
    },
    "engine": "tesseract",
    "engine_args": {
        "lang": "fra",
        "psm": "3"
    },
    "img_url": "http://i.imgur.com/xYAaDjV.png"
}
tleyden commented 9 years ago

Can you try it on the command line with tesseract and let me know using "-c" configvars works there? It's possible that tessedit_create_hocr only works in the "configfile" argument.

If you can make it work, please post the exact command line you are using to invoke tesseract.

evantill commented 9 years ago

Both commands works (tried inside the worker docker image).

tesseract  xYAaDjV.png  stdout -l fra -psm 3 hocr

tesseract  xYAaDjV.png  stdout -l fra -psm 3 -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1

expected result https://gist.github.com/evantill/bb60b2f24033f3036c75

tleyden commented 9 years ago

I wonder if it's related to the order of the args.

Does this work?

tesseract  xYAaDjV.png  stdout -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1 -l fra -psm 3
evantill commented 9 years ago

yes

tleyden commented 9 years ago

Ok I will take a look soon. Thanks for the info.

On Nov 12, 2014, at 12:17 PM, Eric Vantillard notifications@github.com wrote:

yes

— Reply to this email directly or view it on GitHub.

tleyden commented 9 years ago

Btw which version of tesseract were you running?

On Nov 12, 2014, at 12:17 PM, Eric Vantillard notifications@github.com wrote:

yes

— Reply to this email directly or view it on GitHub.

evantill commented 9 years ago
root@2523b7d0b3ed:/tmp# tesseract --version
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
evantill commented 9 years ago

the problem is inside the json (see #19)

using the correct json (could you double check it ?), I have an error

Error processing image url: . Error: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory

any idea where the .txt is coming from ?

logs

21:51:22.367278 OCR_WORKER: got 836643 byte delivery: [3]. Routing key: decode-ocr  Reply to: amq.gen-SBOcgtRuYoq5YdmZWl4j4Q
21:51:22.408596 OCR_TESSERACT: got configVarsMap: map[tessedit_create_hocr:1 tessedit_pageseg_mode:1] type: map[string]interface {}
21:51:22.408624 OCR_TESSERACT: cmdArgs: [/tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1 -psm 3 -l fra]
21:51:25.163100 OCR_TESSERACT: Error getting data from out file: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory
21:51:25.163489 ERROR: Error processing image url: .  Error: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory -- open-ocr.(*OcrRpcWorker).resultForDelivery() at ocr_rpc_worker.go:182
21:51:25.163564 ERROR: Error generating ocr result.  Error: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory -- open-ocr.(*OcrRpcWorker).handle() at ocr_rpc_worker.go:144
21:51:25.163662 OCR_WORKER: Sending rpc response: {Error processing image url: .  Error: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory}
21:51:25.163669 OCR_WORKER: sendRpcResponse to: amq.gen-SBOcgtRuYoq5YdmZWl4j4Q
21:51:25.164202 OCR_WORKER: sendRpcResponse succeeded
tleyden commented 9 years ago

What's happening is that the worker is expecting tesseract to write its output to that file, but the file isn't there, and so it writes that error.

https://github.com/tleyden/open-ocr/blob/master/tesseract_engine.go#L191

I'm not sure why tesseract isn't writing its output in this case. Digging into it.

tleyden commented 9 years ago

OK I found the problem! It expects tesseract to output to a .txt file:

https://github.com/tleyden/open-ocr/blob/master/tesseract_engine.go#L168-L169

but when calling it like this:

tesseract /tmp/1562483a-30ff-46b0-4c01-feda10a1977c /tmp/1562483a-30ff-46b0-4c01-feda10a1977c -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1 -psm 3

the output will be /tmp/1562483a-30ff-46b0-4c01-feda10a1977c.hocr not .txt

tleyden commented 9 years ago

Working on a fix that will check for both file extensions. Also do you happen to know if there are any other possible file extensions besides .txt and .hocr?

tleyden commented 9 years ago

OK it's been fixed and a new docker image pushed.

I haven't fully verified it yet .. still in progress.

Can you try it out?

All you should need to do is make sure you have the latest image by running:

sudo docker pull tleyden5iwx/open-ocr 

on your host OS.

evantill commented 9 years ago

what about using the stdout syntax and redirect it to your temporary file ? like this

tesseract /tmp/1562483a-30ff-46b0-4c01-feda10a1977c stdout -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1 -psm 3 > /tmp/1562483a-30ff-46b0-4c01-feda10a1977c
tleyden commented 9 years ago

@evantill great idea! I added #20

tleyden commented 9 years ago

I was able to verify that it's working:

https://gist.github.com/tleyden/4bcfaff97ecf210a0de5