Closed evantill closed 9 years ago
I don't think it's supported out of the box, but this should be fairly easy to add and is a great addition. Can you post an example configfile on a github gist?
Also, can you try to find out if it's possible to tell it to use hOCR via -c configvar=value rather than a configfile?
Btw the current api docs are here in case you didn't see the link: http://docs.openocr.apiary.io/
thanks for the -configvar=value
workaround.
The hocr configuration file (provided with tesseract) contains :
tessedit_create_hocr 1
tessedit_pageseg_mode 1
but using config_vars does not work has expected (output format is still text not xml).
curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://i.imgur.com/xYAaDjV.png","engine":"tesseract","engine_args":{"psm":"3","lang":"fra"}, "config_vars": {"tessedit_create_hocr":"1","tessedit_pageseg_mode":"1"}}' http://192.168.59.103:$HTTP_PORT/ocr
{
"config_vars": {
"tessedit_create_hocr": "1",
"tessedit_pageseg_mode": "1"
},
"engine": "tesseract",
"engine_args": {
"lang": "fra",
"psm": "3"
},
"img_url": "http://i.imgur.com/xYAaDjV.png"
}
Can you try it on the command line with tesseract and let me know using "-c" configvars works there? It's possible that tessedit_create_hocr only works in the "configfile" argument.
If you can make it work, please post the exact command line you are using to invoke tesseract.
Both commands works (tried inside the worker docker image).
tesseract xYAaDjV.png stdout -l fra -psm 3 hocr
tesseract xYAaDjV.png stdout -l fra -psm 3 -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1
expected result https://gist.github.com/evantill/bb60b2f24033f3036c75
I wonder if it's related to the order of the args.
Does this work?
tesseract xYAaDjV.png stdout -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1 -l fra -psm 3
yes
Ok I will take a look soon. Thanks for the info.
On Nov 12, 2014, at 12:17 PM, Eric Vantillard notifications@github.com wrote:
yes
— Reply to this email directly or view it on GitHub.
Btw which version of tesseract were you running?
On Nov 12, 2014, at 12:17 PM, Eric Vantillard notifications@github.com wrote:
yes
— Reply to this email directly or view it on GitHub.
root@2523b7d0b3ed:/tmp# tesseract --version
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
the problem is inside the json (see #19)
using the correct json (could you double check it ?), I have an error
Error processing image url: . Error: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory
any idea where the .txt is coming from ?
logs
21:51:22.367278 OCR_WORKER: got 836643 byte delivery: [3]. Routing key: decode-ocr Reply to: amq.gen-SBOcgtRuYoq5YdmZWl4j4Q
21:51:22.408596 OCR_TESSERACT: got configVarsMap: map[tessedit_create_hocr:1 tessedit_pageseg_mode:1] type: map[string]interface {}
21:51:22.408624 OCR_TESSERACT: cmdArgs: [/tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1 -psm 3 -l fra]
21:51:25.163100 OCR_TESSERACT: Error getting data from out file: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory
21:51:25.163489 ERROR: Error processing image url: . Error: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory -- open-ocr.(*OcrRpcWorker).resultForDelivery() at ocr_rpc_worker.go:182
21:51:25.163564 ERROR: Error generating ocr result. Error: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory -- open-ocr.(*OcrRpcWorker).handle() at ocr_rpc_worker.go:144
21:51:25.163662 OCR_WORKER: Sending rpc response: {Error processing image url: . Error: open /tmp/5ba6a32a-3263-47ae-5496-7d4a7f62ed0d.txt: no such file or directory}
21:51:25.163669 OCR_WORKER: sendRpcResponse to: amq.gen-SBOcgtRuYoq5YdmZWl4j4Q
21:51:25.164202 OCR_WORKER: sendRpcResponse succeeded
What's happening is that the worker is expecting tesseract to write its output to that file, but the file isn't there, and so it writes that error.
https://github.com/tleyden/open-ocr/blob/master/tesseract_engine.go#L191
I'm not sure why tesseract isn't writing its output in this case. Digging into it.
OK I found the problem! It expects tesseract to output to a .txt file:
https://github.com/tleyden/open-ocr/blob/master/tesseract_engine.go#L168-L169
but when calling it like this:
tesseract /tmp/1562483a-30ff-46b0-4c01-feda10a1977c /tmp/1562483a-30ff-46b0-4c01-feda10a1977c -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1 -psm 3
the output will be /tmp/1562483a-30ff-46b0-4c01-feda10a1977c.hocr not .txt
Working on a fix that will check for both file extensions. Also do you happen to know if there are any other possible file extensions besides .txt and .hocr?
OK it's been fixed and a new docker image pushed.
I haven't fully verified it yet .. still in progress.
Can you try it out?
All you should need to do is make sure you have the latest image by running:
sudo docker pull tleyden5iwx/open-ocr
on your host OS.
what about using the stdout syntax and redirect it to your temporary file ? like this
tesseract /tmp/1562483a-30ff-46b0-4c01-feda10a1977c stdout -c tessedit_create_hocr=1 -c tessedit_pageseg_mode=1 -psm 3 > /tmp/1562483a-30ff-46b0-4c01-feda10a1977c
@evantill great idea! I added #20
I was able to verify that it's working:
how to pass
configfile
parameter to tesseract engine ?see tesseract doc
I use the hocr config file
hOCR is an open standard of data representation for formatted text obtained from OCR