tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.77k stars 9.35k forks source link

tesseract fails to read simple numbers #4285

Open embeh opened 1 month ago

embeh commented 1 month ago

Current Behavior

I am using pytesseract (which calls /usr/bin/tesseract) to recognize numbers of a gas meter. Unfortunately, this very often fails to read most numbers and is very unreliable.

The actual command to get the number string from the image is pytesseract.image_to_string(img, lang='eng', config='--dpi 70 --psm 8 -c tessedit_char_whitelist=,0123456789')

Here is an example image (after some image processing): 20240714-162351_08_ocr

When running this through tesseract (as described above), I just get "2734"... :-(

Any ideas how to improve this, given that there never will be anything but numbers from 0-9 in the image...?

Expected Behavior

Correctly read the numbers. For the image example, this should be "4428734"

Suggested Fix

No response

tesseract -v

tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

Operating System

No response

Other Operating System

Ubuntu 20

uname -a

Linux myhost 4.4.0-19041-Microsoft #4355-Microsoft Thu Apr 12 17:37:00 PST 2024 x86_64 x86_64 x86_64 GNU/Linux

Compiler

No response

CPU

No response

Virtualization / Containers

Ubuntu in WSL2

Other Information

No response

embeh commented 1 month ago

OK, the page segmentation mode seems to be the issue here.

Replacing --psm 8 with --psm 7 produces much better results (so does --psm 11 but none of the others) - but I have no idea why. PSM 8 is advertised as "single word...", isn't that what we have here?

DominicMukilan commented 1 month ago

Why not close the issue if it's resolved?

embeh commented 1 month ago

Well, I think psm 8 should be able to handle this, too, no?

v3ss0n commented 1 month ago

It is still an issue . Tessearact LSTM engine have very hard time reconizing very simple numbers while PaddlePaddleOCR Recongnize well.

OCRCut

here is the result

7% 7% 23
6 6 8

psm 8 dosen't help

Legacy engine improve for numbers but its totally screwed on alphabets.

uttaran-das commented 1 month ago

Hi @embeh , what kind of image processing techniques did you use?

embeh commented 1 month ago

Hi @embeh , what kind of image processing techniques did you use?

A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?

uttaran-das commented 1 month ago

Hi @embeh , what kind of image processing techniques did you use?

A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?

Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.

embeh commented 1 month ago

Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.

I don't really understand the motivation. If, for the same pixels, psm 7 works fine but psm 8 does not - why would a change in the image processing make a difference?

In addition, the contrast is as big as it can be: the background is pure white, the text is fully black, i.e. it is a binary image. Any grey you might see is only due to how github renders the image.

amitdo commented 2 weeks ago

PSM 8 is advertised as "single word...", isn't that what we have here?

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

If it was 'H e l l o' then you could call it a word, but inside a text line, Tesseract will still consider any big enough horizontal white space as a word separator.

amitdo commented 2 weeks ago

tesseract 4.1.1 is too old and we don't support it.

You said you get a better result with psm 7, but you didn't provide the output with this psm.

embeh commented 2 weeks ago

tesseract 4.1.1 is too old and we don't support it.

OK. Unfortunately that seems to be the latest offered by the default Ubuntu repository (and pytesseract?).

You said you get a better result with psm 7, but you didn't provide the output with this psm.

--psm 7 produces the output "4428734" --psm 8 produces the output "4L2B734"

Both were run on the identical image file. You should be able to reproduce this by downloading the image above and run it through tesseract?

So the result is not completely wrong, and it seems not to force the result to multiple words or such. It just messes up the "4" and the "8".

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

OK. These numbers come from an analog counter (think old car's mileage counter), so they are rather "monospaced". I certainly could use image processing to squeeze them together some more but what makes me wonder is that psm 7 simply does the job without such hacks.

Don't get me wrong - I found a solution that works for me; now all I am trying is to provide feedback to help making this an even better piece of software...

embeh commented 2 weeks ago

PSM 8 is advertised as "single word...", isn't that what we have here?

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

I just did a test and manually moved the individual digits closer to each other (without changing any of the black pixels) : image

...and you are correct! Now I get this:

--psm 7: "4428734" --psm 8: "4428734"

So both report the same correct numbers only because the spacing. Interesting!

amitdo commented 2 weeks ago

For psm 8 with the first image, let's say there is a place for improvement...

Tesseract is very popular open source software. We get a lot of questions, bug reports and suggestions, but the team is tiny (4 people currently) and we're all volunteers.