iseahound / Vis2

Simple OCR using Tesseract
158 stars 33 forks source link

Reading some numbers outputs symbols #4

Open Bindslev opened 5 years ago

Bindslev commented 5 years ago

Hello. The numbers I am trying to capture with your tool can only contain numbers, not letters or symbols. However with the current version it sometimes reads some numbers as symbols instead. I suppose that if it was not looking for symbols, but only numbers, it would have a higher success rate in my case.

Is there a way to specify to only look for numbers (or maybe numbers and letters) and nothing else?

Super great tool! Thank you.

JessicaYeh commented 5 years ago

I ran into this same problem. Under the hood this tool is using Tesseract for OCR, so I first tried to modify the command it's running here https://github.com/iseahound/Vis2/blob/72698e859ace7ad3118589706ffed4cdfd81e78a/lib/Vis2.ahk#L2114-L2116 to include the option -c tessedit_char_whitelist=0123456789 so that only those characters can appear in the output. This had no effect. I googled the problem, and it seems like a common problem that is not fixed yet, but the workaround is to add --oem 0 to use the legacy Tesseract engine; see https://github.com/tesseract-ocr/tesseract/issues/751. Maybe I wasn't putting that in the correct place in the command, but everything I tried just crashed it. At the bottom of the comments in that issue, someone put a link to his repo that contains a trained data file for only digits, and optionally also includes dots, commas, etc. Just download the file you are interested in and drop it into bin/tesseract/tessdata_best. Then when you use the OCR function, add the language parameter, making sure the language matches the name of the file you downloaded. I've been using it for a couple days now and seems to work fine.