tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.73k stars 9.54k forks source link

binarization parameters defaults #3707

Open bertsky opened 2 years ago

bertsky commented 2 years ago

Hi. I noticed this new feature 2 days ago and it seemed like a cool new feature to fix issues we have with for example Powerpoint slides that have white on dark text on one side and dark on white text on the other. With method 0, one side gets the short end of the stick and completely disappears. Both method 1 and 2 work fine there.

But I also noticed a lot of new random characters for scans, especially if more a greyish/noisy background. (I can problably filter these out from the using the confidence supplied via hocr.) I stored the internal images for these files. Method 2 doesn't look much different from 0, it leaves more grain, which explains the random characters. Method 1 goes completly bonkers. The whole image is filled with grain artefacts. Depending on DPI there is more or less there, but plenty.

Setting thresholding_smooth_kernel_size to 1 fixed this problem. With the right DPI it actually produces the best looking binary image. I strongly recomment this switch to be set to something else than 0 by default.

Update: While this smoothing fixed the grain issue, it introduces other problems, namely destroying OCR for very small font sizes. I am using the ALFA waffenkatalog from archive.org since it contains densly packed text, and the first 70 pages have been OCRed with Abby, which used some tricky compression, so I can test my PDF rendering, OCR and PDF text layout extraction all in one go :). Since I render PDFs for OCRs at 300 DPI, the dense text is very small (such small text documents should probably be rendered at 600 DPI, but I only know about this after OCR). Method 0 never had a problem with this. Method 1 might actually be even better, but if smoothing is enabled, it loses many of the small characters. Bummer.

Originally posted by @gunnar-ifp in https://github.com/tesseract-ocr/tesseract/issues/3083#issuecomment-1004115008

bertsky commented 2 years ago

Hi @gunnar-ifp, thanks for sharing your experience and thoughts – I have opened this as a separate issue (about empirical evidence and parameterization) to avoid bloating the original one (which was about architecture and workflow).

In general, it would really help if you shared or referenced an example picture (as mentioned in the contributor guidelines) – input and whatever output or intermediate images you have.

While Tesseract's OCR can deal with inverse colours (tessedit_do_invert), its OLR can do so only in a very limited way. So I am not sure this even is a binarization issue at all. But of course, binarization could be made more powerful to avoid modifying the OLR (which is already quite complex and hard to change).

If we do accept this as a binarization problem though, we must bear in mind that Tesseract must optimize for an extremely wide range of images at the same time. Not only is there no best parameter set for all at once, but even finding good compromises is hard, since we do not have a GT test set (let alone a representative one). The current choices are more reflective of previous Tesseract and Leptonica default behaviour (and thus, "backwards compatibility") than a true empirical search. (See here for a recent discussion.)

If the binarized results look grainy with methods 1 or 2, then it's more likely a problem of the window size than anything else IMO. And since the default for the latter is tied to the DPI metadata, please check that these are correct first. (Especially if you have multiple inconsistent tags, like PNG headers plus EXIF etc.) Or use the --dpi / -c user_defined_dpi= override.

Smoothing is only a stop-gap countermeasure for that. Also, it's easy to destroy actual text foreground with that (as you have noticed in your edit), and hard to find a general compromise. That's why it is disabled by default. (Like I said, we would need to make systematic experiments on representative data.)

Text lines should have a height of at least 30 px or so for good quality OCR with Tesseract's stock models IMO. That means at least 300 DPI for 7pt fonts, or 220 DPI for 10pt or 180 DPI for 12pt – roughly.

gunnar-ifp commented 2 years ago

I played a bit more with this feature today :)

The raw png of the problematic file is 2 MB with optipng, I made smaller jpg out of it, it still exhibits the same symptoms. I also added the slide. I use these as input and then dump the internal file with -c tessedit_write_images=1.

slide vi_raw vi_contrast

For the slide: Easily demonstrates the benefits of the two new methods. I noticed that thresholding_tile_size=3 (i.e. ~10 times the default) significantly changes the internal file. Lines get thicker.

For the grainy image (300 dpi): I played with the thresholding score fraction, and yes the 0.1 is a good value, it doesn't help to go higher. It removes noise but makes the text itself noisy.

I increased the tile size, at 1.4 the noise disappeared (DPI 300). But the thing is that if this image has the DPI metadata removed and is fed into tesseract, the binarizer works at 72 DPI (i.e. tile size would have be 300/72 times higher (i.e. 5.8). Tesseract then detects 207 DPI. So from a user perspective this is looks like a hen-egg problem. Smoothing helps, but alas destroys small characters.

As a 3rd file I added a contrast enhanced version of the grainy picture. I actually feed this into tesseract after rendering the PDF (and I only have this active in the PDF code branch, for images I use tesseract as-is). It's a very simple algorithm that tries to auto enhance contrast by counting the levels in the gray-scale pictures. The result is much better, though distracting noise remains that leads to spurious characters. There is actually less noise if DPI 72 is assumed (i.e. very small tiles that do not work in the grainy picture), but with a tile size of 1 the 300 DPI result is very good (i.e. big tiles, that also help with grainy pictures).

So my take away so far is that a bigger tile size helps against grain, auto contrast helps a lot, and it should not be DPI dependent (which kinda works with scans in PDFs that are rendered to images. And is the opposite of your findings. But noise tends not to be so much resolution dependend like content, so that is probably just my view).

wollmers commented 2 years ago

@gunnar-ifp You can set the DPI with a heuristic: A4: (210 mm width / 25,4 mm/inch) * 300 dpi = 2408 pixel wide. If it is in that range, just set the dpi with ImageMagick. If significantly smaller, then upsample the image (ImageMagic has good interpolation). If a page is really scanned with 96 dpi or lower, you will get high error rates.

bertsky commented 2 years ago

@gunnar-ifp

I increased the tile size, at 1.4 the noise disappeared (DPI 300). But the thing is that if this image has the DPI metadata removed and is fed into tesseract, the binarizer works at 72 DPI (i.e. tile size would have be 300/72 times higher (i.e. 5.8). Tesseract then detects 207 DPI. So from a user perspective this is looks like a hen-egg problem.

I don't know what you mean by ... tile size would have be .... And like I said, the most relevant parameter is window size here, not tile size or smoothing. (The reason is that in locally adaptive thresholding, a window around each pixel is used to collect a local histogram; if no actual text fg is seen because the window is too small, then the statistics are off, and bg noise can become amplified to fg.)

At 72 DPI, the default thresholding_window_size=0.33 will make the problem even more severe, because the windows will be 4 times smaller. You should rather go into the other direction (but not by removing DPI or setting input DPI to a wrong value, but directly increasing window_size).

Your image is quite noisy. Tesseract's internal preprocessing does not do contrast normalization or raw denoising yet. If you don't want to do that externally, then try to compensate with larger window_size, or larger kfactor for Sauvola or larger score_fraction for Otsu (yes, text fg might become fractured, but the OCR can usually cope with that better than with heavy noise).

Another thing you might want to try for images like this is external cropping – the black border might negatively influence the adaptive thresholding.

amitdo commented 2 years ago

@bertsky, thresholding_window_size is used for Sauvola, while thresholding_tile_size is used for Otsu. I could have dropped the tile_size and reuse the window_size. but decided not do that.

gunnar-ifp commented 2 years ago

@bertsky, I know about improving the image before feeding to tesseract. This was more about the old Otsu working fine where the new Otsu fails in the default setting with grainy images. And also how the new methods leave more "grain" that creates false characters. I simply tried settings that could mitigate the problem. Smoothing is not really an option. If score_fraction's grainy fg result is not a problem I might try this once more. The problem is that the tile size (which works well) is DPI dependent which creates a problem:

If I renderer something to an image, I probably know the DPI and tell it tesseract, I also know that the tile size will be multiplied with it, so I can adjust it. If I have an image and I know the resolution, then the same applies. If I have an image, as a user, I maybe simply want to feed it into tesseract. This will either use the DPI in the image or estimate DPI. I would have to open and parse the file myself to find out the DPI to adjust the tile size. If there is none one has to use 72 DPI for the tile calculations. Basically adjusting tile size only works if you know there is a DPI in the input (and which one tesseract will use should there be different metadatas) or know that there is no DPI stored in the input. Also grain might stay similar at different scan resolutions, so the tile size to get rid of it might be resolution independent (i.e. a few 100 pxiels or so) Maybe tile size is a bad idea to get rid of the grain (though large tiles do that just fine).

Sauvola hasn't had any problems at all with the grainy image, it simply leaves a bit more visible grains for phantom characters.

amitdo commented 2 years ago

With the old Otsu tile size is the whole image and smooth is zero.

gunnar-ifp commented 2 years ago

I have come to the conclusion that larger tile size (i.e. even the whole image) does not help since it removes characters from the image.

I attached various processed images of the grainy and contrast enhanced pictures for all 3 methods and with extra tile sizes for method 1. Both for 300 DPI and for 72 DPI (emulating what would happen for an image w/o DPI information). vi_processed.zip It's clearly visible that method 1 can not be used with grainy images. Here is the worst case: vi_raw processed 72dpi m1

bertsky commented 2 years ago

@amitdo

thresholding_window_size is used for Sauvola, while thresholding_tile_size is used for Otsu. I could have dropped the tile_size and reuse the window_size. but decided not do that.

oh, sorry, I forgot about that (again). Thanks for pointing that out.

@gunnar-ifp that means you should read my previous comments with window size (for Sauvola) equated to tile size (for Otsu).

This was more about the old Otsu working fine where the new Otsu fails in the default setting with grainy images. And also how the new methods leave more "grain" that creates false characters.

I see. Well, the new methods were not primarily intended for that scenario (but other typical scan/photo challenges like varying brightness or even shadows across the image, varying ink penetration across the image, show-through etc). However, with a suitable (indeed non-default) window/tile size, they should also still cope with noise/grain as well as the old ones. EDIT: but apparently adaptive Otsu does not.

The problem is that the tile size (which works well) is DPI dependent which creates a problem:

If I renderer something to an image, I probably know the DPI and tell it tesseract, I also know that the tile size will be multiplied with it, so I can adjust it. If I have an image and I know the resolution, then the same applies. If I have an image, as a user, I maybe simply want to feed it into tesseract. This will either use the DPI in the image or estimate DPI. I would have to open and parse the file myself to find out the DPI to adjust the tile size. If there is none one has to use 72 DPI for the tile calculations.

Yes, that's inconvenient, but inevitable IMO: Tesseract's DPI plausibilization/estimation depends on textline segmentation, which in turn depends on binarization – so if we want binarization to utilize DPI estimates, we will have a henn-egg problem.

Thus, if you have bad or no DPI metadata, the new binarization methods will likely need manual parameter tweaking – either by DPI overrides or by window/tile size settings.

Basically adjusting tile size only works if you know there is a DPI in the input (and which one tesseract will use should there be different metadatas) or know that there is no DPI stored in the input.

Well, if you don't even know in advance whether you have DPI or not, or whether it's correct or not, then indeed it will be hard to find good parameters. I'd recommend either using the old binarizer in that case, or going fully customized by preprocessing the data externally and scripting the Tesseract calls.

Also grain might stay similar at different scan resolutions, so the tile size to get rid of it might be resolution independent (i.e. a few 100 pxiels or so) Maybe tile size is a bad idea to get rid of the grain (though large tiles do that just fine).

I agree, but again, locally adaptive binarization tries to solve a different problem. (For camera noise or paper grain you might want to try raw denoising externally.)

Perhaps we should offer options for contrast normalization and raw denoising prior to binarization, and then binary denoising posterior in Tesseract (via Leptonica)?

I have come to the conclusion that larger tile size (i.e. even the whole image) does not help since it removes characters from the image. I attached various processed images of the grainy and contrast enhanced pictures for all 3 methods and with extra tile sizes for method 1. Both for 300 DPI and for 72 DPI (emulating what would happen for an image w/o DPI information). vi_processed.zip It's clearly visible that method 1 can not be used with grainy images.

Indeed, that's a severe problem – thanks for compiling the images!

It's not too surprising though: In adaptive Otsu's "all-pixels-in-tile" approach the shallow fg in a tile can disappear just as the noisy bg can survive. In contrast, Sauvola's "pixel-vs-vicinity" approach is truly localized.

I would also say that we could increase the default window/tile sizes from 0.33 to 0.66 or even 1.0 (i.e. 1" worth of page) – in the original discussion about defaults we slowly emancipated from Leptonica's window size for Sauvola (which is even smaller and expressed in absolute pixel size) to DPI relative and larger (but still conservative w.r.t. the literature), then included the adaptive Otsu in a similar fashion (which had no default in Leptonica and used the number of splits of the total image for parameterization).