Open M3ssman opened 4 years ago
The Link from send.firefox.com is active for 1 day. Afterwards it will disappear.
Did you tried to follow documentation?
@zdenop We're scanning from Microfilms using QuantumScan Software and do Preprocessing with QuantumProcess which does a good job with constrasts and deskewing. Therefore I wonder why 999 Images pass, but some few don't, like this one. Could you already take a closer look at the Image? What additional Preprocessing would you suggest if any?
@M3ssman, the link is already inactive. Please attach a sample image to the issue report here.
I just made quick test: When I removed black border and header. Then tesseract produced result (tested just with English, as I not have frk at the moment installed). So pre-processing of such images are must or you need to implement custom layout detection. So sake of small size this thumbnail of my testing image:
@zdenop Yes, many thanks, this works! But anyway, I wonder why all the other images went fine. They are all born the same way. Never cropped, just plain TIF-files with some tuning for contrast and rotation. Tesseract shouldn't bother with borders, as it is (almost) never does.
@stweil Guess what! The original Image works, too, if I scaled it down 1:4!
Sorry, I didn't recognize the link is also off after a single download
-I'm uploading again right now.-
Sorry for the delay, but I run into trouble with uploads from home. Now, from office it is way faster: 0046.zip
@stweil Strange indeed. I cropped only the header of the page and left left, right and bottom margins as they are and this version works fine. 0046-headless.zip
@zdenop Since the original image itself looks dark somehow, I called ImageMagick 6.9 to enhance contrast: convert 0046.tif -brightness-contrast 25x50 -compress none -colorspace Gray 0046-convert.tif
This version, without taking care for the borders, makes Tesseract producing not an empty page, but quite reasonable OCR-output.
I expect that size of 0046-convert.tif is lower. Right?
@zdenop This is what exiftool outputs: original: Megapixels 79.1, Image size 7477x10584, 79151624 Byte filesize convert: Megapixels 79.1, Image size 7477x10584, 79151434 Byte filesize So pure filesizes differs slightly.
@zdenop Did you run Tesseract also with the original file (without cropping or other types of preprocessing? What was the outcome? 0046-convert.zip
Original finished quickly with "empty page" message.
The original page triggers bugs which can be shown by adding -c textord_debug_bugs=1
. Tesseract creates boxes (bounding_box_
) with a right margin which exceeds the image dimensions (error message Made partition with bad right coord
). Those boxes are therefore disregarded. With the following hack the boxes are processed, and text is recognized:
diff --git a/src/textord/colpartition.cpp b/src/textord/colpartition.cpp
index 74f1b1d9..465a1f57 100644
--- a/src/textord/colpartition.cpp
+++ b/src/textord/colpartition.cpp
@@ -353,7 +353,7 @@ bool ColPartition::IsLegal() {
tprintf("Margins invalid\n");
Print();
}
- return false; // Margins invalid.
+// return false; // Margins invalid.
}
if (left_key_ > BoxLeftKey() || right_key_ < BoxRightKey()) {
if (textord_debug_bugs) {
I think that the right solution would have to find out why Tesseract creates bad bounding boxes and fix that. Maybe it would already help to enforce boxes with valid coordinates.
@stweil Many Thanks! By now I've detected already 200+ scans that are considered empty by Tesseract. Therefore I'll try your suggestion in our ULB-Fork and report back hopefully next week!
Please attach the image to this issue.
The image is rather large, too large to be attached. It's available here: https://ub-backup.bib.uni-mannheim.de/~stweil/tesseract/issues/3021/0046.png.
The bounding boxes with illegal coordinates come from rotation:
(gdb)
#1 0x00000000006311f7 in TBOX::rotate (this=0x60e0006ba830, vec=...) at ../../../src/ccstruct/rect.h:206
206 top_right.rotate (vec);
(gdb) p vec
$22 = (const FCOORD &) @0x7fffffff8ea0: {xcoord = 0.999990165, ycoord = -0.00443068426}
(gdb) p top_right
$23 = {xcoord = 7523, ycoord = 10551}
(gdb) p bot_left
$24 = {xcoord = 43, ycoord = 9671}
In this case vec
indicates that there is nearly no rotation at all, but because of the very large value of ycoord
the function ICOORD::rotate
calculates a new xcoord
which is clearly outside of the image. It looks like ICOORD::rotate
might be wrong and need a better implementation.
The current code rotates top right and bottom left with fix point (0,0). Maybe this should be changed to fix point top left. For small coordinates that does not make a large difference, but here it is essential.
Another command that eliminated the issue:
gm convert 3021.png -bordercolor Black -border 10x10 3021-borderb10.png
It's also sufficient to convert the image to JPEG. The basic issue remains of course and can also result in less obvious problems, for example missing text from smaller parts of a page only. I'd expect that typically in the lower left and right parts of large pages. -c textord_debug_bugs=1
should be the default until that problem is fixed.
I now tried a modified TBOX::rotate
. This not only fixes the empty page problem, too, but seems to increase the amount of text which is detected at all, so it would be worth to try it also on other pages. The bad news is that the time for processing a page increases from 56 seconds to 219 seconds. Here is the code:
diff --git a/src/ccstruct/rect.h b/src/ccstruct/rect.h
index 58a867e9..e487c8c1 100644
--- a/src/ccstruct/rect.h
+++ b/src/ccstruct/rect.h
@@ -202,9 +202,13 @@ class DLLSYM TBOX { // bounding box
// and top-right corners. Use rotate_large if you want to guarantee
// that all content is contained within the rotated box.
void rotate(const FCOORD& vec) { // by vector
- bot_left.rotate (vec);
+ ICOORD top_left(bot_left.x(), top_right.y());
+ bot_left -= top_left;
+ bot_left.rotate(vec);
+ bot_left += top_left;
+ top_right -= top_left;
top_right.rotate (vec);
- *this = TBOX (bot_left, top_right);
+ top_right += top_left;
}
// rotate_large constructs the containing bounding box of all 4
// corners after rotating them. It therefore guarantees that all
@M3ssman, we also get "Empty page" errors in our newspaper, see example.
https://github.com/stweil/tesseract/tree/fix contains a patch which seems to fix the problem. Maybe it also gets more texts from other large images, but I am still not sure. For images with large width and height, old and new code can get different results. It would help if you (and others) could try the new code and compare the results with the unpatched Tesseract. If the new code never makes things worse, we could apply it.
@stweil Sorry for the delay! I just took a quick shot at a single page and it did produce textlines which is per se good but forget about the quality. Tesseract is definitively not happy with this image.
I'll try to do some more testing as it affects a remarkable amount of images and report back real soon™.
The fix + textord_debug_bugs=1
produces quite a lot output. Captured into file; maybe you can get some insights.
tesseract-5.0.0-image-1681877805_J_0112_0068.log
Another thing that will make it work is binarization.
For one of the problematic images I got:
/data/ocr-staging/ocr/1667524704_J_0190/0655.tif => 1667524704_J_0190_0655 => /data/ocr-staging/ocr/empty-pages/1667524704_J_0190_0655
Tesseract Open Source OCR Engine v5.0.0-alpha-754-g0838 with Leptonica
Page 1
Detected 7102 diacritics
index >= 0 && index < line_count:Error:Assert failed:in file src/textord/makerow.cpp, line 802
./tesseract-empty-pages.sh: Zeile 34: 29848 Abgebrochen (Speicherabzug geschrieben) ${TESS_BIN} "$tiff_path" "${outpath}" --dpi 470 -l frk alto
I will skip this by now and move on.
With many other "Problem-Bilder" patched Tesseract yields:
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
These error messages are produced by Leptonica.
They are triggered by a call to pixClipBoxToForeground()
https://github.com/tesseract-ocr/tesseract/search?q=pixClipBoxToForeground
I've some larger tests with the patch @stweil provided, with the following results:
From 133 images
index >= 0 && index < line_count:Error:Assert failed:in file src/textord/makerow.cpp, line 802
).
I run the 6 problematic pages once more (v4.1.1-rc2-25-g9707 from alex-p with --dpi 470 -l frk alto
).
This time I've got no assertion errors but 2 pages with textlines and still 4 empty pages.
After enhancing brightness (+50) contrast (between +15 and +25) also these 4 pages were processed without errors.
I'm uncertain how to deal with this. I don't think it's a good idea to silence warnings just to have bad material passing. Second, the assertion error is only in the patched version. This error seems to be really serious, since it even halts execution of my scripts. By now I take the blame by myself, given with advanced preprocessing tesseract produces text. Our processes watch out for those inglorious 844 byte files, but we didn't have this on our agenda before.
@stweil @amitdo @zdenop I'm fine if you close this issue, but if you'd like to, I can provide more testdata.
The "empty page" message means that Tesseract dropped all text boxes because the internal checks decided that they had coordinates which are out of bounds. This might only be the extreme variant of a general problem: maybe Tesseract also drops parts of other pages where it recognizes text, but not all.
That's why it would be important to run OCR on a larger test set with -c textord_debug_bugs=1
to see whether pages with OCR text also show error messages and whether these error messages correspond to missing text boxes on such pages.
@stweil I will run the patch with the 130+ images testset and report back early next week.
@stweil Sorry for the delay! Is the patched code in master branch already?
I'd like to put this issue to an end. By now (IMHO) there are 2 different problems that we're facing here:
1) Tesseract produces ALTO-files missing page content. This is a problem for Tesseract users / apps that utilize Tesseract's output. 2) Tesseract makes wrong decisions about data validity. This may also effect the general detection algorithm. This is a problem both to OCR-Engineers as well as any succeeding users or applications.
To deal with 1), I would like appreciate Tesseract to write no output at all and/or print a warning to stdout. If these options are not worth the additional efforts, please let me know. By now I'm checking the size of the ALTO XML - it works, but it feels like tampering with symptoms.
Number 2 seems to be a really big issue that cannot be solved in total right now.
Thanks for any investigations to @stweil, @zdenop and @amitdo ! All your inspections lead (IMHO) to the
category data error
, since tuning the image-data (binarize, despickle, etc.) improves in all cases Tesseract's analyzis.
Therefore I consider this behavior not as an intrinsic problem of Tesseract, it's the data.
With the code from #3418, when Sauvola binarization is used, I don't get "Empty page!!". "
I just finished OCR with Tesseract 5.0.0 for a huge number of newpaper scans.
-c thresholding_method=2
.-c thresholding_method=1
(example).So using a different binarization helps in most cases, but not always.
Try to convert the jp2 to png. It does not fail for me with your example and method 2.
Thank you, that's interesting. I can reproduce it, and it seems to be related to the image resolution:
The original JP2 image has 300 dpi and fails:
tesseract 0312.jp2 - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1
Made partition with bad left coords, 0 > -8
ColPart: (M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0R1, fc=-1, lc=-1, boxes=1 ts=0 bs=0 ls=0 rs=0
Margins invalid
ColPart:E(M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0T1, fc=-1, lc=-1, boxes=0 ts=0 bs=0 ls=0 rs=0
Empty page!!
Made partition with bad left coords, 0 > -8
ColPart: (M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0R1, fc=-1, lc=-1, boxes=1 ts=0 bs=0 ls=0 rs=0
Margins invalid
ColPart:E(M0-B-8-B-8/-8,5017/5017)->(3599B-3599B-3607M/3599,5071/5071) w-ok=0, v-ok=0, type=0T1, fc=-1, lc=-1, boxes=0 ts=0 bs=0 ls=0 rs=0
Empty page!!
Converting the JP2 to PNG with convert
removes the resolution information.
Tesseract therefore guesses a resolution of 367 dpi and can process the scan:
tesseract 0312.png - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1
Estimating resolution as 367
Made partition with bad right coords, 556 < 577
ColPart: (M184-T204-B207/388,4146/4152)->(577B-1252T-556M/398,4194/4179) w-ok=1, v-ok=1, type=1T4, fc=-1, lc=-1, boxes=24 ts=0 bs=0 ls=0 rs=0
[...]
genommen werden. In Beantwortung verſchledener Anfragen erklärte Amts, Wirklichen Geheimen Rats Der nburg über koloniale J güch die vornebme Aufgabe bringt, ſich des Deutſchen Reiches alt
[...]
Processing the original JP2 with an explicit resolution works, too:
tesseract 0312.jp2 - -l ubma/frak2021-09 -c thresholding_method=2 -c textord_debug_bugs=1 --dpi 400
Made partition with bad right coords, 1232 < 1243
[...]
Zenommen werden. In Beantwortung verſchledener Anfragen erklärte i ĩ ü i
[...]
Thank you, that's interesting. I can reproduce it, and it seems to be related to the image resolution:
The original JP2 image has 300 dpi and fails:
Is it a JPX with mask layer like this https://archive.org/details/bub_gb_qmZyOar8UHwC/page/n71/mode/2up ?
Then try the mask
and negate.
CER 14.23 % is not so bad for the quality of the scan.
Where did you get CER 14.23 %?
@stweil, GIMP reports '72 ppi' for your jp2, but as you said Tesseract see it as 300 ppi. IIRC, when GIMP does not find the ppi in the image metadata, it is reported as 72 ppi.
Where did you get CER 14.23 %?
Good question;-) On logical page 47 of Galileos book.
My comment was meant as: If your jp2 has a mask layer, as jp2 allows many kinds of compressions, then try the mask layer.
The book exists on archive.org in two versions, scanned from two different specimens in different bad conditions:
$ pdfimages -f 69 -l 69 -list bub_gb_7sFnWGI31XcC.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
69 0 image 600 834 rgb 3 8 jpx no 338 0 200 201 4818B 0.3%
69 1 image 1800 2500 rgb 3 8 jpx no 339 0 600 600 13.7K 0.1%
69 2 mask 1800 2500 - 1 1 jpx no 339 0 600 600 13.7K 2.5% <-- the mask
$ pdfimages -f 69 -l 69 bub_gb_7sFnWGI31XcC.pdf bub_gb_7sFnWGI31XcC.p0069
$ ls -la bub_gb_7sFnWGI31XcC.p0069*
bub_gb_7sFnWGI31XcC.p0069-000.ppm
bub_gb_7sFnWGI31XcC.p0069-001.ppm
bub_gb_7sFnWGI31XcC.p0069-002.pbm <-- the mask
$ convert bub_gb_7sFnWGI31XcC.p0069-002.pbm -negate -density 600x600 -units PixelsPerInch bub_gb_7sFnWGI31XcC.p0069-002.pos.tiff
$ tesseract bub_gb_7sFnWGI31XcC.p0069-002.pos.tiff ...
If I recorded correctly (should write a script for permutations and recording them):
Latin = -l lat
GT4 = -l GT4Hist
ubma = -l ubma/frak2021_0.905_1587027_9141630
CER variant
0.0567 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.ubma.txt
0.0841 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.GT4.txt
0.1227 bub_gb_7sFnWGI31XcC.p0069-002.pos.nopsm.Latin.txt
0.1618 bub_gb_qmZyOar8UHwC.p0058-002.pos.nopsm.ubma.txt
GIMP reports '72 ppi'
Obviously GIMP ignores the EXIF metadata. GIMP has a menu entry which shows the metadata and also the EXIF part with x/y resolutions of 300 and the resolution unit "inch". exiftool
shows resolutions of 118.1102 and the resolution unit "cm" which gives the same DPI value of 300 (118.1102 * 2.54).
AFAIK 72 ppi
is the default in some image programs. In GIMP it's AFAIR default only in the GUI Image -> change resolution.
EXIF is the wrong place to specify ppi. convert ... -density 600x600 -units PixelsPerInch ...
is reliable, but not all image formats can store it.
try this code @M3ssman """from PIL import Image ,ImageEnhance
im = Image.open(r""+"C:\Users\user\Documents\Lightshot\stry5.png") cness = ImageEnhance.Sharpness(im) cFactor = 2 im = cness.enhance(cFactor) cness = ImageEnhance.Brightness(im) cFactor = 3 im = cness.enhance(cFactor) im.show() im.save(r""+"C:\Users\user\Documents\Lightshot\stry7.png",quality=95)"""
it finds blobs for all characters
Environment
frk
,Fraktur
(fromtessdata_best
),gt4hist_5000k
(gt4hist-Model with 5000k Iterations)Current Behavior:
When using rather large uncompressed TIF-Files (ca. 80 MB) from Project "Digitalisierung historischer deutscher Zeitschriften" for about 5 Pages (or even less) of 1000 Images we get ALTO-Files missing valid OCR-Date.
When run with
tesseract 0046.tif 0046 -l frk alto
it only alertsEmpy Page!!
and exits in < 20 seconds. 0046-alto.zip 0046-tif.zipGenerated ALTO-File and TIF-Image included.
Expected Behavior:
Produce ALTO-XML with contents.
Suggested Fix:
No idea.