Ability to provide a character whitelist

tleyden commented 10 years ago

Feature request from @barrypitman on twitter:

@OpenOCR cool project! How would I go about providing a character whitelist? And it'd be great to be able to upload a file, not just URL
— Barry Pitman (@barrypitman) June 25, 2014

tleyden commented 10 years ago

I need a similar feature too, I want to be able to replace the tesseract dictionary/wordlist with a completely custom dictionary/wordlist. At this point I have no idea how to do it, even using just straight tesseract (let alone the go-tesseract wrapper). I filed an issue against go-tesseract (https://github.com/GeertJohan/go.tesseract/issues/3) but have yet to dig into it yet.

tleyden commented 10 years ago

If you can provide any insight on how to do this with tesseract on the command line or api, it would be helpful.

barrypitman commented 10 years ago

Specifying a character whitelist on the command line is described here: https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_recognize_only_digits. In my instance, using only digits/numbers with Tesseract 3 command line looks like this:

tesseract.exe idNumber.png output digits && cat output.txt

'digits' is a standard config file in <tesseract_home>/tessdata/configs Doing the same for a custom whitelist is described here: http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for

I think this FAQ addresses the "using a different wordlist" question - https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_add_just_one_character_or_one_font_to_my_favourite_lang

tleyden commented 10 years ago

@barrypitman thanks for these references .. checking this out.

barrypitman commented 10 years ago

The digits parameter is actually just shorthand for passing this config parameter to tesseract: tessedit_char_whitelist 0123456789

If you are on tesseract 3.03, you could pass config parameters directly the command line using "-c" instead of using a config file i.e.

tesseract image.png stdout -c tessedit_char_whitelist=0123456789 > output.txt

In general, I think it would be useful to be able to pass arbitrary config parameters (tesseract supports of lot of different ones), as well as the "psm" mode (page layout analysis) and maybe language to a tesseract REST API.

tleyden commented 10 years ago

I just checked the dockerfile and it looks like it's using the latest from Debian Jessie, which is 3.02-2. Even Ubuntu Trusty is using the same version. So I guess to use 3.0.3 building from source is the only option?

barrypitman commented 10 years ago

If I use ubuntu:14.04 and apt-get install tesseract-ocr, I get tesseract 3.03.

tesseract 3.02 was released in Oct 2012, whereas tesseract 3.03 was released in January I think. 3.03 includes support for reading files from stdin and output to stdout, as well as passing more options on the command-line, making it easier to work with.

tleyden commented 10 years ago

In general, I think it would be useful to be able to pass arbitrary config parameters (tesseract supports of lot of different ones), as well as the "psm" mode (page layout analysis) and maybe language to a tesseract REST API.

Yeah, totally agree. Aside from the tesseract version issue, I guess the next step is to figure out how to propagate these params to go-tesseract or consider using a process spawning approach rather than go-tesseract.

If I use ubuntu:14.04 and apt-get install tesseract-ocr, I get tesseract 3.03.

Actually, I take back what I said about Debian Jessie then .. because for that package (tesseract-ocr), it's also using 3.03

I think I get it now. There are two packages, one with the engine, (3.02-2) and one with the CLI tool (3.03).

tleyden commented 10 years ago

I think I have a solution.

Since OpenOCR supports pluggable "engines" (eg, in the future I want to support totally different engines such as Ocropus), I can make a new engine called "tesseract-exec", which will call tesseract more directly via exec until go-tesseract has a clear way of passing these kinds of config options.

And there would be another param in the JSON POST request called "engine-args", which would get passed through to the "tesseract-exec" engine, which could then pass to the tesseract process invoked via exec.

tleyden commented 10 years ago

@barrypitman made some progress on the https://github.com/tleyden/open-ocr/tree/feature/tesseract_exec_engine branch. Gonna have to pick this up later though .. sigh.

tleyden commented 10 years ago

In this code: https://github.com/tleyden/open-ocr/blob/feature/tesseract_exec_engine/tesseract_engine_exec.go#L89

I experimented with

cmd := exec.Command("tesseract", inputFilename, tmpOutFileBaseName, "-c", "tessedit_char_whitelist=0123456789")

and it worked.

{ 011  2111 01133126 10031  31  111165 1 01   116 1 1 61 11635  1211 11  116  811111 13126 13
1 1 3 3161   116  31 211113 1131116  1211  1  1 5  5 3711   211 1311 16 112111135 1121 16  0 1 3
001111 053 1 0  311 1121111111161   6113111012813 5111 21  116 11111 161 8001 6  111  116 63211111 16
1 610  1 1121 16 118611  1 1155   31  1 0115  1131   01 1  1 01   31 211113 112111133

}

tleyden commented 10 years ago

@barrypitman nearly done with this, should be pushing a new version and updating docs in the next hour or so.

tleyden commented 10 years ago

This is now finished and pushed to github, new docker image available on dockerhub.

Here's an example:

$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract",  "engine_args":{"config_vars":{"tessedit_char_whitelist":"0123456789"}, "psm":"3"}}' http://$DOCKER_HOST:$HTTP_PORT/ocr

Response:

 011  2111 01133126 10031  31  111165 1 01   116 1 1 61 11635  1211 11  116  811111 13126 13
1 1 3 3161   116  31 211113 1131116  1211  1  1 5  5 3711   211 1311 16 112111135 1121 16  0 1 3
001111 053 1 0  311 1121111111161   6113111012813 5111 21  116 11111 161 8001 6  111  116 63211111 16
1 610  1 1121 16 118611  1 1155   31  1 0115  1131   01 1  1 01   31 211113 112111133

Now there are two engines available:

tesseract - the new version (default), which calls tesseract via exec and is able to pass -c and -psm args
go_tesseract - the old version, which calls tesseract via go-tesseract, and cannot accept -c or -psm args.

API docs coming soon.

/cc @barrypitman

tleyden commented 10 years ago

REST API docs:

rendered raw RAML

danielpater commented 6 years ago

Unfortunately, config_vars do not seem to work with tleyden5iwx/open-ocr-2 Docker image. Is there any way to pass a char whitelist when using tleyden5iwx/open-ocr-2? Maybe inside docker-compose file?

tleyden commented 6 years ago

Can you open a new issue?

danielpater commented 6 years ago

https://github.com/tleyden/open-ocr/issues/99

tleyden / open-ocr

Ability to provide a character whitelist #8