tleyden / open-ocr

Run your own OCR-as-a-Service using Tesseract and Docker
Apache License 2.0
1.33k stars 223 forks source link

psm argument doesn't work with Tesseract 4.0 #110

Open serhii-eleks opened 5 years ago

serhii-eleks commented 5 years ago

Hello,

I'm trying to launch your environment with tleyden5iwx/open-ocr-2 image. This image should contain Tesseract 4.0. Looks like decoding image/pdf using psm argument doesn't work.

Request Body: { "img_url": "http://bit.ly/ocrimage", "engine": "tesseract", "engine_args": { "config_vars": { "tessedit_char_whitelist": "0123456789" }, "psm": "3" } }

Reponse: Error processing image url: . Error: exit status 1

In Tesseract 3.* psm argument use one "-psm", in Tesseract 4.0 two "--psm". I think this is the main issue.

By the way, can you create one addition argument where I can control the output. Not only raw text. I want to receive text in *.hocr format too. And any other. I would be very appreciate to have this feature!

Thanks!

serhii-eleks commented 5 years ago

I found config argument: "tessedit_create_hocr": "1" in order to return data in hocr format.

tleyden commented 5 years ago

Do the docs need to be improved?

serhii-eleks commented 5 years ago

Yes, would be great to improve docs available output formats. But the main issue is the system cannot execute tesseract with "--psm" parameter.

serhii-eleks commented 5 years ago

Hello @tleyden

Do you have any updates?

Thanks.

mirko0x5f commented 5 years ago

Hello! We're having the exact same problem. We would like to launch tesseract with the psm:3 parameter but we fail to do so for tesseract 4.0.

mirko0x5f commented 5 years ago

The problem seems to be in this line https://github.com/tleyden/open-ocr/blob/1cd43c1659c42dd65487559e9f055436c25b0e06/tesseract_engine.go#L87 we managed to fix it only for tesseract 4.0 by changing it in result = append(result, "--psm") probably it's needed to switch between the 2 cases to make the change backward compatible.

overwerk commented 5 years ago

we managed to fix it only for tesseract 4.0 by changing it in result = append(result, "--psm")

Could you please say where exactly I need to replace it? I went into the docker container of the worker and httpd and located the mentioned file, changed it and restarted both containers. Error is still the same.

thliew commented 4 years ago

There is a workaround solution for this issue.. Get into OCR worker container You should be able to list your running docker container with command below docker ps List should contain dockercompose_openocr_1, dockercompose_openocrworker_1, dockercompose_strokewidthtransform_1 and dockercompose_rabbitmq_1 in NAMES column which corresponding to HTTP handler, OCR worker, pre-processor and RabbitMQ.

Use command below to get into OCR worker container docker exec -it <container_id> /bin/bash

Refactor the source code cd /opt/go/src/github.com/tleyden/open-ocr/ vim tesseract_engine.go Around line 87: Change result = append(result, "-psm") into result = append(result, "--psm")

Recompiling execution file cd /opt/go/src/github.com/tleyden/open-ocr/cli-worker && go build -v -o open-ocr-worker && cp open-ocr-worker /usr/bin

If you encountered the message below: cp: cannot create regular file '/usr/bin/open-ocr-worker': Text file busy you may try to restart the container and try cp it again

Restart docker container Exit docker container and restart it by using command below: docker restart <container_id>