tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.75k stars 9.34k forks source link

layout: `Empty Page` output for default psm #3670

Open Shreeshrii opened 2 years ago

Shreeshrii commented 2 years ago

For certain images the default psm gives Empty Page as output while --psm 6 and others give the correct result.

Suggest that in cases where default psm results in Empty Page, try recognizing image with --psm 6 automatically along with a DEBUG message.

$ tesseract -v
tesseract 5.0.0-1-g4abb
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found NEON
 Found OpenMP 201511
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
 Found libcurl/7.58.0 NSS/3.35 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3

Example image: eng Charis_SIL_Italic exp0_27

$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --tessdata-dir ~/tessdata
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --tessdata-dir ~/tessdata_best
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --tessdata-dir ~/tessdata_fast
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 0
Too few characters. Skipping this page
Too few characters. Skipping this page
Error during processing.
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 1
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 4 blob text block, but using orientation anyway: 0
Empty page!!
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 4 blob text block, but using orientation anyway: 0
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 2
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 3
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 4
Empty page!!
Empty page!!
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 5
oy
0
0
O
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 6
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 7
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 8
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 9
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 10
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 11
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 12
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 4 blob text block, but using orientation anyway: 0
6881
$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --psm 13
6881
$
stweil commented 2 years ago

Using --psm 6 is at least for newspapers (where we also see "empty" pages) not the correct solution. In those cases using a different binarization usually helps.

Shreeshrii commented 2 years ago

You are right. --psm 6 will work only if the input is a single line image.

I am finding the issue in about 1% of images generated by tesseract unpack from lstmf files which were generated by text2image. Shouldn't all these files have same binarization?

Shreeshrii commented 2 years ago

Here is a zip file with some images which have this problem. A few ok images are also included.

EmptyPage.zip.zip In most cases it is images with a single word/number in it in a large font size. Hope this helps in isolating the cause.

amitdo commented 2 years ago

Empty page!! Empty page!!

Why is this message printed twice?

amitdo commented 2 years ago

Does this also happen with --oem 0?

Shreeshrii commented 2 years ago

Yes, it is also happening with --oem 0.

(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --oem 0 --tessdata-dir ../tessdata
Empty page!!
Empty page!!

The problem seems to be related to dpi.

(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 600
Empty page!!
Empty page!!
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 300
Empty page!!
Empty page!!
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 200
6881
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 250
Empty page!!
Empty page!!
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png - --dpi 150
6881

Image is being recognized if I assign dpi 200 and 150.

I tried to display the earlier messages regarding the dpi being used, but they seem to have been suppressed now , even with --loglevel ALL.

(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract eng.Charis_SIL_Italic.exp0_27.png -  --loglevel ALL
Empty page!!
Empty page!!
amitdo commented 2 years ago

In general, if you know in advance that the input is one line, then you should use --psm 7.

amitdo commented 2 years ago

The dpi in this image is in valid range (301) so tesseract will respect it and will not try to estimate it. That's why there is no warning.

amitdo commented 2 years ago

Your suggestion to make Tesseract do a second try can be improved by taking into account image height and number of blobs. For example, If the image height is below 60 pixels and has less than 100 blobs, Tesseract can try psm 6 and if it also fails it can then try psm 7.

amitdo commented 2 years ago

Using the API, you can give Tesseract an alternative config file and if recognition fails, Tesseract will do a second try using this config file.

amitdo commented 2 years ago

https://github.com/tesseract-ocr/tesseract/blob/b649222de3fc9270e3e6c5b03b180bf09f4b4f9c/src/api/baseapi.cpp#L1268-L1283

Shreeshrii commented 2 years ago

In general, if you know in advance that the input is one line, then you should use --psm 7.

I am trying to look for alternative ways to evaluate the recognition by different models since lstmeval does not give accurate results. So, I am using the single line images used for training and eval by tesstrain makefile and then using OCR results using ocrevalUAtion and ISRI tools. I could use --psm 7 for it. Would that be considered ok as a basis for evaluation?

amitdo commented 2 years ago

I don't know, you can try and see...

stweil commented 2 years ago

Empty page output for complex newspaper pages is handled in issue #3021.