tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.08k stars 9.39k forks source link

single digits not getting recognized #2389

Open Shreeshrii opened 5 years ago

Shreeshrii commented 5 years ago

tesseract -v tesseract 4.1.0-rc1-255-g332a1 leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

Please see the issue opened by @jandier with a number of images which are NOT being recognized or being recognized incorrectly. https://github.com/Shreeshrii/tessdata_shreetest/issues/5#issuecomment-483053018

Shreeshrii commented 5 years ago

Using the finetuned digits traineddata gives slightly better results in some cases, but still does not work with default --psm.

This issue with non-recognition of small images has also been reported elsewhere. @stweil @bertsky Any suggestions for improving this.

Shreeshrii commented 5 years ago

Here is the output for 0-9.png and 06.jpg (different style and size of 6).

The digits config file which uses the whitelist feature improves the result. Thanks, @bertsky.

*****  num/06.jpg OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
5
**** with digits config ****
5

*****  num/0.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
Co
**** with digits config ****
0

*****  num/1.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
IE
**** with digits config ****

*****  num/2.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
2
**** with digits config ****
2

*****  num/3.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
3
**** with digits config ****
3

*****  num/4.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
Ce
**** with digits config ****

*****  num/5.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
Cs
**** with digits config ****
5

*****  num/6.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
Ce
**** with digits config ****
6

*****  num/7.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
7
**** with digits config ****
7

*****  num/8.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
Cs
**** with digits config ****
8

*****  num/9.png OEM 1 LANG eng TESSDATA tessdata_best
**** PSM 3 ****
Empty page!!
Empty page!!
**** with digits config ****
Empty page!!
Empty page!!
**** PSM 8 ****
Cs
**** with digits config ****
rexlow commented 5 years ago

I noticed the same if that single digit is placed far away from other blocks of characters. Interestingly, Google Cloud Vision sometimes suffer from the same problem.

Shreeshrii commented 5 years ago

Empty page issue also reported in https://github.com/tesseract-ocr/tesseract/issues/1362