Open embeh opened 4 months ago
OK, the page segmentation mode seems to be the issue here.
Replacing --psm 8
with --psm 7
produces much better results (so does --psm 11
but none of the others) - but I have no idea why.
PSM 8 is advertised as "single word...", isn't that what we have here?
Why not close the issue if it's resolved?
Well, I think psm 8 should be able to handle this, too, no?
It is still an issue . Tessearact LSTM engine have very hard time reconizing very simple numbers while PaddlePaddleOCR Recongnize well.
here is the result
7% 7% 23
6 6 8
psm 8 dosen't help
Legacy engine improve for numbers but its totally screwed on alphabets.
Hi @embeh , what kind of image processing techniques did you use?
Hi @embeh , what kind of image processing techniques did you use?
A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?
Hi @embeh , what kind of image processing techniques did you use?
A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?
Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.
Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.
I don't really understand the motivation. If, for the same pixels, psm 7 works fine but psm 8 does not - why would a change in the image processing make a difference?
In addition, the contrast is as big as it can be: the background is pure white, the text is fully black, i.e. it is a binary image. Any grey you might see is only due to how github renders the image.
PSM 8 is advertised as "single word...", isn't that what we have here?
What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.
If it was 'H e l l o' then you could call it a word, but inside a text line, Tesseract will still consider any big enough horizontal white space as a word separator.
tesseract 4.1.1 is too old and we don't support it.
You said you get a better result with psm 7, but you didn't provide the output with this psm.
tesseract 4.1.1 is too old and we don't support it.
OK. Unfortunately that seems to be the latest offered by the default Ubuntu repository (and pytesseract?).
You said you get a better result with psm 7, but you didn't provide the output with this psm.
--psm 7 produces the output "4428734" --psm 8 produces the output "4L2B734"
Both were run on the identical image file. You should be able to reproduce this by downloading the image above and run it through tesseract?
So the result is not completely wrong, and it seems not to force the result to multiple words or such. It just messes up the "4" and the "8".
What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.
OK. These numbers come from an analog counter (think old car's mileage counter), so they are rather "monospaced". I certainly could use image processing to squeeze them together some more but what makes me wonder is that psm 7 simply does the job without such hacks.
Don't get me wrong - I found a solution that works for me; now all I am trying is to provide feedback to help making this an even better piece of software...
PSM 8 is advertised as "single word...", isn't that what we have here?
What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.
I just did a test and manually moved the individual digits closer to each other (without changing any of the black pixels) :
...and you are correct! Now I get this:
--psm 7: "4428734" --psm 8: "4428734"
So both report the same correct numbers only because the spacing. Interesting!
For psm 8 with the first image, let's say there is a place for improvement...
Tesseract is very popular open source software. We get a lot of questions, bug reports and suggestions, but the team is tiny (4 people currently) and we're all volunteers.
Current Behavior
I am using pytesseract (which calls
/usr/bin/tesseract
) to recognize numbers of a gas meter. Unfortunately, this very often fails to read most numbers and is very unreliable.The actual command to get the number string from the image is
pytesseract.image_to_string(img, lang='eng', config='--dpi 70 --psm 8 -c tessedit_char_whitelist=,0123456789')
Here is an example image (after some image processing):
When running this through tesseract (as described above), I just get "2734"... :-(
Any ideas how to improve this, given that there never will be anything but numbers from 0-9 in the image...?
Expected Behavior
Correctly read the numbers. For the image example, this should be "4428734"
Suggested Fix
No response
tesseract -v
tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Operating System
No response
Other Operating System
Ubuntu 20
uname -a
Linux myhost 4.4.0-19041-Microsoft #4355-Microsoft Thu Apr 12 17:37:00 PST 2024 x86_64 x86_64 x86_64 GNU/Linux
Compiler
No response
CPU
No response
Virtualization / Containers
Ubuntu in WSL2
Other Information
No response