Open tleyden opened 10 years ago
I need a similar feature too, I want to be able to replace the tesseract dictionary/wordlist with a completely custom dictionary/wordlist. At this point I have no idea how to do it, even using just straight tesseract (let alone the go-tesseract wrapper). I filed an issue against go-tesseract (https://github.com/GeertJohan/go.tesseract/issues/3) but have yet to dig into it yet.
If you can provide any insight on how to do this with tesseract on the command line or api, it would be helpful.
Specifying a character whitelist on the command line is described here: https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_recognize_only_digits. In my instance, using only digits/numbers with Tesseract 3 command line looks like this:
tesseract.exe idNumber.png output digits && cat output.txt
'digits' is a standard config file in <tesseract_home>/tessdata/configs
Doing the same for a custom whitelist is described here: http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for
I think this FAQ addresses the "using a different wordlist" question - https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_add_just_one_character_or_one_font_to_my_favourite_lang
@barrypitman thanks for these references .. checking this out.
The digits
parameter is actually just shorthand for passing this config parameter to tesseract:
tessedit_char_whitelist 0123456789
If you are on tesseract 3.03, you could pass config parameters directly the command line using "-c" instead of using a config file i.e.
tesseract image.png stdout -c tessedit_char_whitelist=0123456789 > output.txt
In general, I think it would be useful to be able to pass arbitrary config parameters (tesseract supports of lot of different ones), as well as the "psm" mode (page layout analysis) and maybe language to a tesseract REST API.
I just checked the dockerfile and it looks like it's using the latest from Debian Jessie, which is 3.02-2. Even Ubuntu Trusty is using the same version. So I guess to use 3.0.3 building from source is the only option?
If I use ubuntu:14.04 and apt-get install tesseract-ocr
, I get tesseract 3.03.
tesseract 3.02 was released in Oct 2012, whereas tesseract 3.03 was released in January I think. 3.03 includes support for reading files from stdin and output to stdout, as well as passing more options on the command-line, making it easier to work with.
In general, I think it would be useful to be able to pass arbitrary config parameters (tesseract supports of lot of different ones), as well as the "psm" mode (page layout analysis) and maybe language to a tesseract REST API.
Yeah, totally agree. Aside from the tesseract version issue, I guess the next step is to figure out how to propagate these params to go-tesseract or consider using a process spawning approach rather than go-tesseract.
If I use ubuntu:14.04 and apt-get install tesseract-ocr, I get tesseract 3.03.
Actually, I take back what I said about Debian Jessie then .. because for that package (tesseract-ocr
), it's also using 3.03
I think I get it now. There are two packages, one with the engine, (3.02-2) and one with the CLI tool (3.03).
I think I have a solution.
Since OpenOCR supports pluggable "engines" (eg, in the future I want to support totally different engines such as Ocropus), I can make a new engine called "tesseract-exec", which will call tesseract more directly via exec until go-tesseract has a clear way of passing these kinds of config options.
And there would be another param in the JSON POST request called "engine-args", which would get passed through to the "tesseract-exec" engine, which could then pass to the tesseract process invoked via exec.
@barrypitman made some progress on the https://github.com/tleyden/open-ocr/tree/feature/tesseract_exec_engine branch. Gonna have to pick this up later though .. sigh.
In this code: https://github.com/tleyden/open-ocr/blob/feature/tesseract_exec_engine/tesseract_engine_exec.go#L89
I experimented with
cmd := exec.Command("tesseract", inputFilename, tmpOutFileBaseName, "-c", "tessedit_char_whitelist=0123456789")
and it worked.
{ 011 2111 01133126 10031 31 111165 1 01 116 1 1 61 11635 1211 11 116 811111 13126 13
1 1 3 3161 116 31 211113 1131116 1211 1 1 5 5 3711 211 1311 16 112111135 1121 16 0 1 3
001111 053 1 0 311 1121111111161 6113111012813 5111 21 116 11111 161 8001 6 111 116 63211111 16
1 610 1 1121 16 118611 1 1155 31 1 0115 1131 01 1 1 01 31 211113 112111133
}
@barrypitman nearly done with this, should be pushing a new version and updating docs in the next hour or so.
This is now finished and pushed to github, new docker image available on dockerhub.
Here's an example:
$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract", "engine_args":{"config_vars":{"tessedit_char_whitelist":"0123456789"}, "psm":"3"}}' http://$DOCKER_HOST:$HTTP_PORT/ocr
Response:
011 2111 01133126 10031 31 111165 1 01 116 1 1 61 11635 1211 11 116 811111 13126 13
1 1 3 3161 116 31 211113 1131116 1211 1 1 5 5 3711 211 1311 16 112111135 1121 16 0 1 3
001111 053 1 0 311 1121111111161 6113111012813 5111 21 116 11111 161 8001 6 111 116 63211111 16
1 610 1 1121 16 118611 1 1155 31 1 0115 1131 01 1 1 01 31 211113 112111133
Now there are two engines available:
API docs coming soon.
/cc @barrypitman
Unfortunately, config_vars
do not seem to work with tleyden5iwx/open-ocr-2
Docker image. Is there any way to pass a char whitelist when using tleyden5iwx/open-ocr-2
? Maybe inside docker-compose file?
Can you open a new issue?
Feature request from @barrypitman on twitter: