tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.75k stars 9.35k forks source link

Tesseract Empty Page #3021

Open M3ssman opened 4 years ago

M3ssman commented 4 years ago

Environment

Current Behavior:

When using rather large uncompressed TIF-Files (ca. 80 MB) from Project "Digitalisierung historischer deutscher Zeitschriften" for about 5 Pages (or even less) of 1000 Images we get ALTO-Files missing valid OCR-Date.

When run with tesseract 0046.tif 0046 -l frk alto it only alerts Empy Page!! and exits in < 20 seconds. 0046-alto.zip 0046-tif.zip

Generated ALTO-File and TIF-Image included.

Expected Behavior:

Produce ALTO-XML with contents.

Suggested Fix:

No idea.

M3ssman commented 4 years ago

The Link from send.firefox.com is active for 1 day. Afterwards it will disappear.

zdenop commented 4 years ago

Did you tried to follow documentation?

M3ssman commented 4 years ago

@zdenop We're scanning from Microfilms using QuantumScan Software and do Preprocessing with QuantumProcess which does a good job with constrasts and deskewing. Therefore I wonder why 999 Images pass, but some few don't, like this one. Could you already take a closer look at the Image? What additional Preprocessing would you suggest if any?

stweil commented 4 years ago

@M3ssman, the link is already inactive. Please attach a sample image to the issue report here.

zdenop commented 4 years ago

I just made quick test: When I removed black border and header. Then tesseract produced result (tested just with English, as I not have frk at the moment installed). So pre-processing of such images are must or you need to implement custom layout detection. So sake of small size this thumbnail of my testing image: image

M3ssman commented 4 years ago

@zdenop Yes, many thanks, this works! But anyway, I wonder why all the other images went fine. They are all born the same way. Never cropped, just plain TIF-files with some tuning for contrast and rotation. Tesseract shouldn't bother with borders, as it is (almost) never does.

M3ssman commented 4 years ago

@stweil Guess what! The original Image works, too, if I scaled it down 1:4!

Sorry, I didn't recognize the link is also off after a single download

-I'm uploading again right now.-

Sorry for the delay, but I run into trouble with uploads from home. Now, from office it is way faster: 0046.zip

M3ssman commented 4 years ago

@stweil Strange indeed. I cropped only the header of the page and left left, right and bottom margins as they are and this version works fine. 0046-headless.zip

M3ssman commented 4 years ago

@zdenop Since the original image itself looks dark somehow, I called ImageMagick 6.9 to enhance contrast: convert 0046.tif -brightness-contrast 25x50 -compress none -colorspace Gray 0046-convert.tif This version, without taking care for the borders, makes Tesseract producing not an empty page, but quite reasonable OCR-output.

zdenop commented 4 years ago

I expect that size of 0046-convert.tif is lower. Right?

M3ssman commented 4 years ago

@zdenop This is what exiftool outputs: original: Megapixels 79.1, Image size 7477x10584, 79151624 Byte filesize convert: Megapixels 79.1, Image size 7477x10584, 79151434 Byte filesize So pure filesizes differs slightly.

M3ssman commented 4 years ago

@zdenop Did you run Tesseract also with the original file (without cropping or other types of preprocessing? What was the outcome? 0046-convert.zip

zdenop commented 4 years ago

Original finished quickly with "empty page" message.

stweil commented 4 years ago

The original page triggers bugs which can be shown by adding -c textord_debug_bugs=1. Tesseract creates boxes (bounding_box_) with a right margin which exceeds the image dimensions (error message Made partition with bad right coord). Those boxes are therefore disregarded. With the following hack the boxes are processed, and text is recognized:

diff --git a/src/textord/colpartition.cpp b/src/textord/colpartition.cpp
index 74f1b1d9..465a1f57 100644
--- a/src/textord/colpartition.cpp
+++ b/src/textord/colpartition.cpp
@@ -353,7 +353,7 @@ bool ColPartition::IsLegal() {
       tprintf("Margins invalid\n");
       Print();
     }
-    return false;  // Margins invalid.
+//    return false;  // Margins invalid.
   }
   if (left_key_ > BoxLeftKey() || right_key_ < BoxRightKey()) {
     if (textord_debug_bugs) {

I think that the right solution would have to find out why Tesseract creates bad bounding boxes and fix that. Maybe it would already help to enforce boxes with valid coordinates.

M3ssman commented 4 years ago

@stweil Many Thanks! By now I've detected already 200+ scans that are considered empty by Tesseract. Therefore I'll try your suggestion in our ULB-Fork and report back hopefully next week!

amitdo commented 4 years ago

Please attach the image to this issue.

https://help.github.com/en/github/managing-your-work-on-github/file-attachments-on-issues-and-pull-requests

stweil commented 4 years ago

The image is rather large, too large to be attached. It's available here: https://ub-backup.bib.uni-mannheim.de/~stweil/tesseract/issues/3021/0046.png.

stweil commented 4 years ago

The bounding boxes with illegal coordinates come from rotation:

(gdb)
#1  0x00000000006311f7 in TBOX::rotate (this=0x60e0006ba830, vec=...) at ../../../src/ccstruct/rect.h:206
206       top_right.rotate (vec);
(gdb) p vec
$22 = (const FCOORD &) @0x7fffffff8ea0: {xcoord = 0.999990165, ycoord = -0.00443068426}
(gdb) p top_right 
$23 = {xcoord = 7523, ycoord = 10551}
(gdb) p bot_left
$24 = {xcoord = 43, ycoord = 9671}

In this case vec indicates that there is nearly no rotation at all, but because of the very large value of ycoord the function ICOORD::rotate calculates a new xcoord which is clearly outside of the image. It looks like ICOORD::rotate might be wrong and need a better implementation.

The current code rotates top right and bottom left with fix point (0,0). Maybe this should be changed to fix point top left. For small coordinates that does not make a large difference, but here it is essential.

amitdo commented 4 years ago

Another command that eliminated the issue:

gm convert 3021.png -bordercolor Black -border 10x10 3021-borderb10.png

stweil commented 4 years ago

It's also sufficient to convert the image to JPEG. The basic issue remains of course and can also result in less obvious problems, for example missing text from smaller parts of a page only. I'd expect that typically in the lower left and right parts of large pages. -c textord_debug_bugs=1 should be the default until that problem is fixed.

stweil commented 4 years ago

I now tried a modified TBOX::rotate. This not only fixes the empty page problem, too, but seems to increase the amount of text which is detected at all, so it would be worth to try it also on other pages. The bad news is that the time for processing a page increases from 56 seconds to 219 seconds. Here is the code:

diff --git a/src/ccstruct/rect.h b/src/ccstruct/rect.h
index 58a867e9..e487c8c1 100644
--- a/src/ccstruct/rect.h
+++ b/src/ccstruct/rect.h
@@ -202,9 +202,13 @@ class DLLSYM TBOX  {  // bounding box
     // and top-right corners. Use rotate_large if you want to guarantee
     // that all content is contained within the rotated box.
     void rotate(const FCOORD& vec) {  // by vector
-      bot_left.rotate (vec);
+      ICOORD top_left(bot_left.x(), top_right.y());
+      bot_left -= top_left;
+      bot_left.rotate(vec);
+      bot_left += top_left;
+      top_right -= top_left;
       top_right.rotate (vec);
-      *this = TBOX (bot_left, top_right);
+      top_right += top_left;
     }
     // rotate_large constructs the containing bounding box of all 4
     // corners after rotating them. It therefore guarantees that all
stweil commented 4 years ago

@M3ssman, we also get "Empty page" errors in our newspaper, see example.

https://github.com/stweil/tesseract/tree/fix contains a patch which seems to fix the problem. Maybe it also gets more texts from other large images, but I am still not sure. For images with large width and height, old and new code can get different results. It would help if you (and others) could try the new code and compare the results with the unpatched Tesseract. If the new code never makes things worse, we could apply it.

M3ssman commented 4 years ago

@stweil Sorry for the delay! I just took a quick shot at a single page and it did produce textlines which is per se good but forget about the quality. Tesseract is definitively not happy with this image.

I'll try to do some more testing as it affects a remarkable amount of images and report back real soon™.

The fix + textord_debug_bugs=1 produces quite a lot output. Captured into file; maybe you can get some insights. tesseract-5.0.0-image-1681877805_J_0112_0068.log

amitdo commented 4 years ago

Another thing that will make it work is binarization.

M3ssman commented 4 years ago

For one of the problematic images I got:

/data/ocr-staging/ocr/1667524704_J_0190/0655.tif => 1667524704_J_0190_0655 => /data/ocr-staging/ocr/empty-pages/1667524704_J_0190_0655
Tesseract Open Source OCR Engine v5.0.0-alpha-754-g0838 with Leptonica
Page 1
Detected 7102 diacritics
index >= 0 && index < line_count:Error:Assert failed:in file src/textord/makerow.cpp, line 802
./tesseract-empty-pages.sh: Zeile 34: 29848 Abgebrochen             (Speicherabzug geschrieben) ${TESS_BIN} "$tiff_path" "${outpath}" --dpi 470 -l frk alto

I will skip this by now and move on.

With many other "Problem-Bilder" patched Tesseract yields:

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
amitdo commented 4 years ago
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

427 #468 #1601

amitdo commented 4 years ago

These error messages are produced by Leptonica.

They are triggered by a call to pixClipBoxToForeground()

https://github.com/DanBloomberg/leptonica/blob/bbe289cf3f0fe368d5b9eac64df2ccd6e9b05c56/src/pix5.c#L1956

https://github.com/tesseract-ocr/tesseract/search?q=pixClipBoxToForeground

M3ssman commented 4 years ago

I've some larger tests with the patch @stweil provided, with the following results:

From 133 images

I run the 6 problematic pages once more (v4.1.1-rc2-25-g9707 from alex-p with --dpi 470 -l frk alto). This time I've got no assertion errors but 2 pages with textlines and still 4 empty pages. After enhancing brightness (+50) contrast (between +15 and +25) also these 4 pages were processed without errors.

I'm uncertain how to deal with this. I don't think it's a good idea to silence warnings just to have bad material passing. Second, the assertion error is only in the patched version. This error seems to be really serious, since it even halts execution of my scripts. By now I take the blame by myself, given with advanced preprocessing tesseract produces text. Our processes watch out for those inglorious 844 byte files, but we didn't have this on our agenda before.

@stweil @amitdo @zdenop I'm fine if you close this issue, but if you'd like to, I can provide more testdata.

stweil commented 4 years ago

The "empty page" message means that Tesseract dropped all text boxes because the internal checks decided that they had coordinates which are out of bounds. This might only be the extreme variant of a general problem: maybe Tesseract also drops parts of other pages where it recognizes text, but not all.

That's why it would be important to run OCR on a larger test set with -c textord_debug_bugs=1 to see whether pages with OCR text also show error messages and whether these error messages correspond to missing text boxes on such pages.

M3ssman commented 4 years ago

@stweil I will run the patch with the 130+ images testset and report back early next week.

M3ssman commented 3 years ago

@stweil Sorry for the delay! Is the patched code in master branch already?

I'd like to put this issue to an end. By now (IMHO) there are 2 different problems that we're facing here:

1) Tesseract produces ALTO-files missing page content. This is a problem for Tesseract users / apps that utilize Tesseract's output. 2) Tesseract makes wrong decisions about data validity. This may also effect the general detection algorithm. This is a problem both to OCR-Engineers as well as any succeeding users or applications.

To deal with 1), I would like appreciate Tesseract to write no output at all and/or print a warning to stdout. If these options are not worth the additional efforts, please let me know. By now I'm checking the size of the ALTO XML - it works, but it feels like tampering with symptoms.

Number 2 seems to be a really big issue that cannot be solved in total right now.

Thanks for any investigations to @stweil, @zdenop and @amitdo ! All your inspections lead (IMHO) to the category data error, since tuning the image-data (binarize, despickle, etc.) improves in all cases Tesseract's analyzis. Therefore I consider this behavior not as an intrinsic problem of Tesseract, it's the data.

amitdo commented 3 years ago

With the code from #3418, when Sauvola binarization is used, I don't get "Empty page!!". "

stweil commented 2 years ago

I just finished OCR with Tesseract 5.0.0 for a huge number of newpaper scans.

So using a different binarization helps in most cases, but not always.

amitdo commented 2 years ago

Try to convert the jp2 to png. It does not fail for me with your example and method 2.

stweil commented 2 years ago

Thank you, that's interesting. I can reproduce it, and it seems to be related to the image resolution:

The original JP2 image has 300 dpi and fails:

tesseract 0312.jp2 - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1
Made partition with bad left coords, 0 > -8
ColPart: (M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0R1, fc=-1, lc=-1, boxes=1 ts=0 bs=0 ls=0 rs=0
Margins invalid
ColPart:E(M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0T1, fc=-1, lc=-1, boxes=0 ts=0 bs=0 ls=0 rs=0
Empty page!!
Made partition with bad left coords, 0 > -8
ColPart: (M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0R1, fc=-1, lc=-1, boxes=1 ts=0 bs=0 ls=0 rs=0
Margins invalid
ColPart:E(M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0T1, fc=-1, lc=-1, boxes=0 ts=0 bs=0 ls=0 rs=0
Empty page!!

Converting the JP2 to PNG with convert removes the resolution information. Tesseract therefore guesses a resolution of 367 dpi and can process the scan:

tesseract 0312.png - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1
Estimating resolution as 367
Made partition with bad right coords, 556 < 577
ColPart: (M184-T204-B207/388,4146/4152)->(577B-1252T-556M/398,4194/4179) w-ok=1, v-ok=1, type=1T4, fc=-1, lc=-1, boxes=24 ts=0 bs=0 ls=0 rs=0
[...]
genommen werden. In Beantwortung verſchledener Anfragen erklärte Amts, Wirklichen Geheimen Rats Der nburg über koloniale J güch die vornebme Aufgabe bringt, ſich des Deutſchen Reiches alt
[...]

Processing the original JP2 with an explicit resolution works, too:

tesseract 0312.jp2 - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1 --dpi 400
Made partition with bad right coords, 1232 < 1243
[...]
Zenommen werden. In Beantwortung verſchledener Anfragen erklärte i ĩ ü i
[...]
wollmers commented 2 years ago

Thank you, that's interesting. I can reproduce it, and it seems to be related to the image resolution:

The original JP2 image has 300 dpi and fails:

Is it a JPX with mask layer like this https://archive.org/details/bub_gb_qmZyOar8UHwC/page/n71/mode/2up ?

Then try the mask

bub_gb_7sFnWGI31XcC p0069-002

and negate.

CER 14.23 % is not so bad for the quality of the scan.

stweil commented 2 years ago

Where did you get CER 14.23 %?

amitdo commented 2 years ago

@stweil, GIMP reports '72 ppi' for your jp2, but as you said Tesseract see it as 300 ppi. IIRC, when GIMP does not find the ppi in the image metadata, it is reported as 72 ppi.

wollmers commented 2 years ago

Where did you get CER 14.23 %?

Good question;-) On logical page 47 of Galileos book.

My comment was meant as: If your jp2 has a mask layer, as jp2 allows many kinds of compressions, then try the mask layer.

The book exists on archive.org in two versions, scanned from two different specimens in different bad conditions:

$ pdfimages -f 69 -l 69 -list bub_gb_7sFnWGI31XcC.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
  69     0 image     600   834  rgb     3   8  jpx    no       338  0   200   201 4818B 0.3%
  69     1 image    1800  2500  rgb     3   8  jpx    no       339  0   600   600 13.7K 0.1%
  69     2 mask     1800  2500  -       1   1  jpx    no       339  0   600   600 13.7K 2.5%    <-- the mask

$ pdfimages -f 69 -l 69 bub_gb_7sFnWGI31XcC.pdf bub_gb_7sFnWGI31XcC.p0069

$ ls -la bub_gb_7sFnWGI31XcC.p0069*
bub_gb_7sFnWGI31XcC.p0069-000.ppm
bub_gb_7sFnWGI31XcC.p0069-001.ppm
bub_gb_7sFnWGI31XcC.p0069-002.pbm   <-- the mask

$ convert bub_gb_7sFnWGI31XcC.p0069-002.pbm -negate -density 600x600 -units PixelsPerInch bub_gb_7sFnWGI31XcC.p0069-002.pos.tiff

$ tesseract bub_gb_7sFnWGI31XcC.p0069-002.pos.tiff ...

If I recorded correctly (should write a script for permutations and recording them):

Latin = -l lat
GT4 = -l GT4Hist
ubma = -l ubma/frak2021_0.905_1587027_9141630

CER     variant
0.0567 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.ubma.txt
0.0841 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.GT4.txt
0.1227 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.Latin.txt

0.1618 bub_gb_qmZyOar8UHwC.p0058-002.pos.nopsm.ubma.txt
stweil commented 2 years ago

GIMP reports '72 ppi'

Obviously GIMP ignores the EXIF metadata. GIMP has a menu entry which shows the metadata and also the EXIF part with x/y resolutions of 300 and the resolution unit "inch". exiftool shows resolutions of 118.1102 and the resolution unit "cm" which gives the same DPI value of 300 (118.1102 * 2.54).

wollmers commented 2 years ago

AFAIK 72 ppi is the default in some image programs. In GIMP it's AFAIR default only in the GUI Image -> change resolution.

EXIF is the wrong place to specify ppi. convert ... -density 600x600 -units PixelsPerInch ... is reliable, but not all image formats can store it.

aved12 commented 2 years ago

try this code @M3ssman """from PIL import Image ,ImageEnhance

im = Image.open(r""+"C:\Users\user\Documents\Lightshot\stry5.png") cness = ImageEnhance.Sharpness(im) cFactor = 2 im = cness.enhance(cFactor) cness = ImageEnhance.Brightness(im) cFactor = 3 im = cness.enhance(cFactor) im.show() im.save(r""+"C:\Users\user\Documents\Lightshot\stry7.png",quality=95)"""

it finds blobs for all characters