tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.62k stars 9.53k forks source link

RFC: allow flexible or better binarization #3083

Open bertsky opened 4 years ago

bertsky commented 4 years ago

Tesseract has always included its own, internal binarization – which is not based on Leptonica and is of rather bad quality (custom global Otsu implementation without normalization). Leptonica does have lots of nice adaptive local normalization and thresholding implementations, but they are not utilized.

Since Tesseract 4.0, recognition does not (under normal circumstances) use that binarized image, but uses the greyscale-converted raw image. If the input image was already bitonal, then it still works. However, what the LSTM model expects depends on what data it was trained on. (Original tessdata models were all trained on artificial/clean greyscale IIRC.)

But the binarized image is still needed for all segmentation and layout analysis (OSD, separator detection, picture detection). And to make matters worse, segmentation also makes use of the greyscale image and threshold values from the binarizer – in order to get a better approximation/interpolation of blob outlines (ComputeEdgeOffsets). If the input image was already bitonal, then a fallback method is used which is not as accurate (ComputeBinaryOffsets).

Now the user (both CLI and API) is in a dilemma:

  1. Present the originial/raw image:
    • Segmentation on some images may be suboptimal, because good binarization is hard.
    • Recognition might expect a (colour/contrast-) normalized or even binarized (e.g. from tesstrain) image and thus be suboptimal.
  2. Present an externally binarized image:
    • Segmentation on other images may be suboptimal, because the blob outlines are inaccurate.
    • Recognition might expect a greyscale image and thus be suboptimal.

So what do we do? Allow delegating to Leptonica's pixContrastNorm, pixSauvolaBinarizeTiled etc. methods via parameter variable? Or extend the API to allow passing the threshold values of an external binarization? How do we inform/document this and encapsulate an LSTM model's expectations?

There are many more aspects, but I just wanted to open the discussion.

If requested, I can provide example images (of bad segmentation due to internal Otsu or bad segmentation due to inaccurate interpolation from bitonal input; of recognition perplexed by background masked to white), as well as pointers to code.

wrznr commented 4 years ago

@bertsky Many thanks for bringing up this important issue. It might indeed be helpful to illustrate the matter with sample images (in order to get the wave of comments rolling).

amitdo commented 4 years ago

Thanks for opening this RFC. Very nice analysis.

I will add my comments later.

amitdo commented 4 years ago

Regarding ComputeEdgeOffsets, I assume the more accurate outlines only matter for the legacy engine and not for the nn engine. Correct? The nn enginr does not act on segmented glyphs like the legacy engine.

amitdo commented 4 years ago

Segmentation on some images may be suboptimal, because good binarization is hard.

Image binarization based on neural net might solve it. With regular 'dumb' methods, you can use a brute force method. Any other idea?

amitdo commented 4 years ago

Recognition might expect a (colour/contrast-) normalized or even binarized (e.g. from tesstrain) image and thus be suboptimal.

In the short term we should document somewhere on what kind of images our models were trained. Currently, it's grayscale (with 0 and 255 as values). In the long term, make it easy to query the model for this information via the API.

MerlijnWajer commented 3 years ago

(Not a core dev, but attempting to become a core user and help out some with dev)

So what do we do? Allow delegating to Leptonica's pixContrastNorm, pixSauvolaBinarizeTiled etc. methods via parameter variable? Or extend the API to allow passing the threshold values of an external binarization? How do we inform/document this and encapsulate an LSTM model's expectations?

Perhaps the API could be passed a callback function that takes the to-be-thresholded image and returns the transform of the image into thresholded values, rather than passing values in via the API (but perhaps that's what you meant). This could then be utilised by the command-line frontend to potentially even invoke external scripts/executables from within that function (not great for performance, but flexible).

Allowing various (predefined) thresholding methods could also be done through some plugin structure, perhaps?

amitdo commented 3 years ago

https://groups.google.com/g/tesseract-ocr/c/bNk9lYa-xmw

Robyer commented 3 years ago

What about adding new method to provide both "original" and "binarized" images? That way client can use own binarization method while different parts of Tesseract can still use specific image variant based on their needs.

amitdo commented 3 years ago

@bertsky, @stweil

Suggested plan:

amitdo commented 3 years ago

5.0 now has two alternative binarization options:

Both use Leptonica.

Usage: tesseract in.png out -c thresholding_method=2

This will use Sauvola. 1 will use adaptive Otsu. Currently, the default is 0 - Otsu, which uses tesseract's legacy (non-adaptive) Otsu.

MerlijnWajer commented 3 years ago

Awesome work @amitdo! (Sorry, I missed the PR)

I'll give this a test and see how it fares. Would it be worth to attempt an experiment where we see if certain binarisation algorithm (always always?) yields better results, from a quality point of view? I could also benchmark the speed at the same time, perhaps.

By the way: what does "Tiled Sauvola" mean? Is that an implementation detail, or is that somehow different from "normal" Sauvola? (Leptonica docs seem to suggest it's a way to deal with large images, but it's not clear to me if that somehow affects the results)

amitdo commented 3 years ago

Hi Merlijn,

Tiled means it splits the image into tiles (currently in fixed size of 300x300) and do the thresholding separately for each tile, and in the final stage combine the thresholded tiles together. This should minimize memory consumption.

https://github.com/DanBloomberg/leptonica/blob/08fa22d6e83f/src/binarize.c#L447

OCR quality is obviously very impotent, but speed and memory consumption are also important factors.

Please also benchmark the new adaptive Otsu against the legacy Otsu in terms of OCR quality, speed and memory consumption. I hope we can get rid of the legacy Otsu (it uses custom code, not Leptonica).

mdecerbo commented 3 years ago

@MerlijnWajer Dan Bloomberg writes: The Sauvola method for local binarization does quite well, and we implement it with tiling for efficiency. 64 bit floating point arrays [...] are expensive for large images. Consequently, we give a tiled version. This gives the identical results as the non-tiled method, but only requires accumulator arrays to be in memory for each tile separately.

@amitdo thanks for adding Sauvola binarization to tesseract!

Something closely related: the fourth method in leptonica's prog/binarize_set.c does contrast normalization (pixContrastNorm(NULL, pixg, 20, 20, 130, 2, 2);) before calling pixSauvolaBinarizeTiled(). I've had very good results with a modification to Tesseract that copies that.

If it would be in scope for Tesseract to optionally allow contrast normalization before Sauvola binarization, what should the interface look like? I don't mind working on a patch so I can get rid of my locally hacked version.

MerlijnWajer commented 3 years ago

I've spent a few days adding support (to our cluster that OCRs a few million pages every day with Tesseract) to capture the maximum memory used by Tesseract and also log the time for different binarisation algorithms. I hope to report back this week with data on memory usage and speed on a decent sample size.

One other thing I was wondering about reading @amitdo 's comment here: https://github.com/tesseract-ocr/tesseract/issues/3433#issuecomment-843870876 -- it might be worth checking out if other Sauvola parameters perform better/different for script detection. We seem to currently use a window size of 25 and a factor of 0.4. I don't know if the window size should perhaps be dependent on the DPI of the document, and if a different factor (I use 0.3 in another project) performs better.

amitdo commented 3 years ago

Merlijn,

The parameters 25 and 0.4 are what Leptonica uses in a demo program. I can expose them as Tesseract paramers and/or try to set the window size related to the DPI. I don't want to submit it to this repo until it is proven useful. Will you test it if I'll do it in my own fork in GitHub?

In general, the DPI info is not reliable. Very often its value is 0 or a wrong low value.

MerlijnWajer commented 3 years ago

@amitdo - I'd be happy to try it out. I am not sure if the best one-size-fits-all exists, but I think it makes sense to try to see if we can find something that works well.

Regarding detected DPI, makes sense. I've found that at times Tesseract picks up the DPI information embedded in the image metadata, so maybe it could be used then (again, I don't even know if it makes sense to scale according to DPI).

wollmers commented 3 years ago

@MerlijnWajer As far as I understand "tiled" it's an implementation detail to save memory. Leptonica uses state of the art Integral Images which is nearly as fast as Otsu and speed does not depend on window size. But mean and standard deviation need to be stored for each pixel. Thus "tiled" saves a lot of memory.

Regarding threshold k it should be between 0.2 and 0.5. Some authors experimented with the value of k and got 0.34 as optimal but not essential.

Found no papers about the influence of the window size on quality. Small sizes like 15 will tend to be more noisy while large like 50 more like global binarization. Would be nice if text printed in light grey between normal black text survives the binarization. In most cases 40 will be near the line height (baseline to baseline). E. g. in a book printed ~1830 scanned at 300 dpi most of the lines are 45 pixels high which means 10 points.

Scaling images to a reasonable resolution of 150 or 300 dpi gives better recognition results in any way. Tested this for a recruiter on a sample of 100 CVs, many with low resolution of 96 dpi. Just scaling with convert (ImageMagick) into the range of 2000 x 3000 pixels improves CER. The explanation is simple: At 96 dpi the stems of the letters of regular weight are narrower than or near 1 pixel. Scaling them up bicubic 300 % leads to stems 2-4 pixel wide, which gives the binarization and recognition a better chance. Difference in CER: >5%.

If you use scanned old books from archive.org, what I do 90 %, use the original jp2-images (300 dpi in most cases) if available. The images in the PDFs are highly compressed and smooth shades are lost. This also limits the chances of binarization. Difference in CER: >3%.

MerlijnWajer commented 3 years ago

@MerlijnWajer As far as I understand "tiled" it's an implementation detail to save memory. Leptonica uses state of the art Integral Images which is nearly as fast as Otsu and speed does not depend on window size. But mean and standard deviation need to be stored for each pixel. Thus "tiled" saves a lot of memory.

That matches the explanation above, thanks.

Regarding threshold k it should be between 0.2 and 0.5. Some authors experimented with the value of k and got 0.34 as optimal but not essential.

Found no papers about the influence of the window size on quality. Small sizes like 15 will tend to be more noisy while large like 50 more like global binarization. Would be nice if text printed in light grey between normal black text survives the binarization. In most cases 40 will be near the line height (baseline to baseline). E. g. in a book printed ~1830 scanned at 300 dpi most of the lines are 45 pixels high which means 10 points.

Scaling images to a reasonable resolution of 150 or 300 dpi gives better recognition results in any way. Tested this for a recruiter on a sample of 100 CVs, many with low resolution of 96 dpi. Just scaling with convert (ImageMagick) into the range of 2000 x 3000 pixels improves CER. The explanation is simple: At 96 dpi the stems of the letters of regular weight are narrower than or near 1 pixel. Scaling them up bicubic 300 % leads to stems 2-4 pixel wide, which gives the binarization and recognition a better chance. Difference in CER: >5%.

Could you share some more information about your tests? I am a little surprised that scaling would help that much.

If you use scanned old books from archive.org, what I do 90 %, use the original jp2-images (300 dpi in most cases) if available. The images in the PDFs are highly compressed and smooth shades are lost. This also limits the chances of binarization. Difference in CER: >3%.

I wrote the current archive.org OCR stack that is based on Tesseract (as well as the new PDF compression), and we do not scale images before we OCR them. I'd love to chat about how you OCR the content yourself and about your suggested improvements, but this issue on the bugtracker might not be place (you can mail me merlijn at archive.org).

wollmers commented 3 years ago

Tried some examples and compared the results. It's a trial and not a systematic test.

Usage:
tesseract in.png out -c thresholding_method=<thresh>

0 - legacy (non-adaptive) Otsu (default)
1 - adaptive Otsu (Leptonica)
2 - tiled Sauvola (Leptonica)

Version: compiled from github, Tesseract v5.0.0-alpha-20210401 (git 38f0fdc) on Linux/Debian

@w4:~$ tesseract --version
tesseract 5.0.0-alpha-20210401
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511

RESULTS

Scanned historic Books

Samples: https://github.com/wollmers/ocr-tess-issues/tree/main/issues/issue_3083_binarisation

Sauvola is an improvement compared to (legacy) global binarisation.

See example https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/naturgeschichte00gt_0014.png and the resulting files.

CER (Character Error Rate) of Sauvola is 0.0045 compared to global 0.0063.

Adaptive Otsu (Leptonica) is a problem on pages with large areas of empty but noisy background, e. g. title pages.

See black blocks https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0007.tessinput.thresh1.tif

or emphasising shining through letters https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.tessinput.thresh1.tif

CER of the last example

thresholding_method=0  => 0.1355
thresholding_method=1  => 0.0793
thresholding_method=2  => 0.2113

These high CERs and the big differences cannot be explained by the binarisation methods, but some small differences between the binarised images seem to intensify or reduce the impact of other issues.

In any case cutting page images into single lines before OCR will improve the results.

Clean computer generated pages

https://github.com/wollmers/ocr-tess-issues/tree/main/issues/issue_3083_binarisation_grey

Selected some problematic corner cases like low contrast and light on dark.

Low contrast parts disappear with all methods. Cutting the part of the image with light grey on white text out for OCR gives perfect results.

Light on dark is damaged by thresholding_method=1 and thresholding_method=2.

Legacy has stable results. Adaptive Otsu and Sauvola both damage some areas with text.

SPEED

On a 2190 × 2802 jp2 the times (incl. OCR) in seconds are

thresholding_method=0  => 3.838s
thresholding_method=1  => 3.086s # fastest
thresholding_method=2  => 3.907s

Sauvola has nearly the same speed as legacy. Adaptive Otsu (Leptonica) is significantly faster (~25%, ~1 CPU-second).

CONCLUSION

Legacy method is not so bad and should be kept in Tesseract as long as the other methods are not better in all cases.

Some behaviours of the binarisation methods of Leptonica look like issues, maybe parameters, algorithm or implementation.

MerlijnWajer commented 3 years ago

@wollmers - great report (I got sidetracked with some other stuff and didn't deliver on mine yet :-(), I wonder if it is possible that tessdata is trained with the built-in binarisation, and that affects accuracy, giving us skewed results?

amitdo commented 3 years ago

AFAIK, the training was done by generating images from ground truth without adding any background.

Tesseract can be trained on real images (RGB or grayscale).

wollmers commented 3 years ago

@MerlijnWajer AFAIK tessdata was trained from generated images using text and fonts. A part of the training images is artificially degraded (unsharpened) to train the models on low image quality too.

Training from scans of old books will and should contain degraded images (broken letters, over- oder under-inked, speckles etc.). It should be trained on degraded quality to some limits, because scans of old books are always degraded.

MerlijnWajer commented 3 years ago

A few other thoughts:

  1. There are faster sauvola binarisation methods out there than what Leptonica uses (I think). I took code from here https://github.com/chungkwong/binarizer and ported it to Cython for some project (https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/cython/sauvola.pyx) - could try to port that to C (and change the license) and see if it is faster than the Leptonica method. But the quality matters more right now, I think. (EDIT: thresholding the below image takes 0.11s on my laptop with the cython implementation, so that might indicate it's faster, looking at the difference in time you reported)
  2. On the provided example (https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation_grey/en_cv_0025.part.jpg), Sauvola performs much more like the other two methods if the input image is inverted first. Worth noting that also for the other binarisation methods, part of the text is white (the top), and for the other methods, the text is black (the bottom part). I don't know what Tesseract does with these results. See https://archive.org/~merlijn/en_cv_0025.part_sauvola_inverted_k0.3_windowsize51.png
  3. As I understand it (please correct me if I am wrong, I might misremember), binarisation is used for segmentation, but not for actual recognition, so it might not actually "damage" text as long as the segmenter picks it up?
bertsky commented 3 years ago

Sorry, I'm late to the party, so I need to go back a little:

@MerlijnWajer

it might be worth checking out if other Sauvola parameters perform better/different for script detection. We seem to currently use a window size of 25 and a factor of 0.4. I don't know if the window size should perhaps be dependent on the DPI of the document, and if a different factor (I use 0.3 in another project) performs better.

Indeed, and IMHO that's true not just for OSD but for all use-cases. Sauvola window size needs to be consistent with pixel density, and the k parameter should be adaptable by the user (to increase or decrease fg weight depending on material and application). I have just created a hindsight review of the PR, arguing the details.

@amitdo

In general, the DPI info is not reliable. Very often its value is 0 or a wrong low value.

That's true of course, unfortunately. However, we still do have estimated_res_, which is only a stopgap / rule of thumb (multiplying the median line height by 10), but serves its purpose IMHO.

@wollmers

Regarding threshold k it should be between 0.2 and 0.5. Some authors experimented with the value of k and got 0.34 as optimal but not essential.

From my own experiments (with GT against OCR pipelines in different binarizations) I would say it really depends on the material – if the letters are very thin and there is little ink, then I would use 0.2 or even lower. If on the other hand there is a lot of ink plus bleeding or shine-through, then going 0.5 or higher might do better. So it should be up to the user (but I would also recommend the standard 0.34 default).

Found no papers about the influence of the window size on quality. Small sizes like 15 will tend to be more noisy while large like 50 more like global binarization. Would be nice if text printed in light grey between normal black text survives the binarization. In most cases 40 will be near the line height (baseline to baseline). E. g. in a book printed ~1830 scanned at 300 dpi most of the lines are 45 pixels high which means 10 points.

I can say (from above experience) that window size plays a large role if you want to optimise for different kinds of material at the same time. In your scenario with light grey letters between black lines, yes, you would need a smaller window size for them to survive. But on the other hand a very common case with historic materials is shine-through, which of course needs a larger window size (for the same reason). In your 300 DPI calculation (i.e. assuming average text is 10pt and thus gets 42px) I would argue that the window should still be larger (say 80 or 100px) because

  1. type size is not equal line height (you still have the leading)
  2. it should be large enough to encompass a sufficient amount of background (to better balance the statistics in the Sauvola denominator)

Scaling images to a reasonable resolution of 150 or 300 dpi gives better recognition results in any way. Tested this for a recruiter on a sample of 100 CVs, many with low resolution of 96 dpi. Just scaling with convert (ImageMagick) into the range of 2000 x 3000 pixels improves CER. The explanation is simple: At 96 dpi the stems of the letters of regular weight are narrower than or near 1 pixel. Scaling them up bicubic 300 % leads to stems 2-4 pixel wide, which gives the binarization and recognition a better chance. Difference in CER: >5%.

I concur this is the best pragmatic approach for low-resolution input, but we should strive to improve Tesseract in a way that this becomes unnecessary. For segmentation, a better choice of window size and weight with adaptive algorithms could already deal with this at lower resolution (preserving thin lines or even reinforcing them). For recognition (at least LSTM engine/models) IIRC – please cmiiw – this should not apply anymore, because input is prescaled to the network's input size (typically 36px including padding).

Tried some examples and compared the results. It's a trial and not a systematic test.

[...]

Adaptive Otsu (Leptonica) is a problem on pages with large areas of empty but noisy background, e. g. title pages. See black blocks or emphasising shining through letters

This speaks to the interpretation in my above mentioned review – current parameters for tiling are fixed and suboptimal even for 300 DPI input.

Low contrast parts disappear with all methods. Cutting the part of the image with light grey on white text out for OCR gives perfect results.

Also a matter of tile size (and window size for Sauvola) IMHO. But contrast normalization could of course help here, as @mdecerbo pointed out. (Sauvola is especially sensitive to the dynamic range of the histogram.)

But cropping and contrast stretching do not always help, there are pathological cases where a large black margin needs to stay in the histogram in order to have shine-through keep its distance from foreground, see here.

@MerlijnWajer

I wonder if it is possible that tessdata is trained with the built-in binarisation, and that affects accuracy, giving us skewed results?

I don't think so, not quite like that. (As the others have already noted) the stock models were trained on all-white bg plus artificial noise, and for the tesstrained models it depends on the training data. And since at runtime only the grayscale image is fed, there is a soft dependency on the input colorspace.

But the internal binarization is never used for the LSTMs. (See here for call-stack and here for CER comparison of raw vs. bin vs. nrm runtime input for one stock and one tesstrain model. EDIT the latter link shows how external binarization does heavily influence recognition.)

  • There are faster sauvola binarisation methods out there than what Leptonica uses (I think). I took code from here and ported it to Cython for some project - could try to port that to C (and change the license) and see if it is faster than the Leptonica method. But the quality matters more right now, I think.

For another fast (and versatile) C++ implementation, allow me to point you to Olena/Scribo. It contains implementations for various binarization algorithms:

scribo-cli --help

(See here for the paper on multi-scale Sauvola.)

Yes, that's because that image is partly inverse, and Sauvola's formula is sensitive to the dynamic range. Improvements like Wolf and Jolion 2004 or Lazzara 2013 (multiscale) are robust to that (while still being locally adaptive).

Worth noting that also for the other binarisation methods, part of the text is white (the top), and for the other methods, the text is black (the bottom part). I don't know what Tesseract does with these results. See https://archive.org/~merlijn/en_cv_0025.part_sauvola_inverted_k0.3_windowsize51.png

Tesseract should still be fine, because during layout analysis, it checks whether a segment is inverse (with a histogram heuristic IIRC), and compensates before passing to the recognition.

  • As I understand it (please correct me if I am wrong, I might misremember), binarisation is used for segmentation, but not for actual recognition, so it might not actually "damage" text as long as the segmenter picks it up?

Yes, exactly (see above for call-tree). And layout analysis in turn is most sensitive to horizontal and vertical separator line detection...

MerlijnWajer commented 3 years ago

Here is an example of some images where the built-in binarisation fails to find anything: https://archive.org/~merlijn/tesseract-binarisation/ - both binarisation method (1) and (2) do find text, but they find different text.

It sounds to me like @bertsky 's suggestions above would be worth looking into, in particular perhaps the sauvola multi scale binarisation. Additionally tweaking the current Sauvola parameters to the suggested defaults (0.34, and window size set to detected dpi) would make sense as well. I don't know if the contrast normalisation matters if we look into newer algorithms, but perhaps that's not a bad idea either...

For the sake of experimentation, it might be useful to add some code that provides a simple way to invoke an external script/binary that takes a grayscaled image and binarises it for Tesseract?

wollmers commented 3 years ago

@MerlijnWajer

A few other thoughts:

  1. There are faster sauvola binarisation methods out there than what Leptonica uses (I think). I took code from here https://github.com/chungkwong/binarizer and ported it to Cython for some project (https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/cython/sauvola.pyx) - could try to port that to C (and change the license) and see if it is faster than the Leptonica method. But the quality matters more right now, I think. (EDIT: thresholding the below image takes 0.11s on my laptop with the cython implementation, so that might indicate it's faster, looking at the difference in time you reported)

The claimed innovation of Chungkwong 2019 is only a reduction in space complexity as it doesn't need full size integral images. It provides the same result as Sauvola. Using less space can have an impact on runtime (alloc is expensive, cache misses).

What needs investigation:

"The proposed implementation is about 30% faster than the approach using integral images, but still approximately six times slower than Otsu’s method."

Compared to the claim of Shafait 2008, p. 16, Fig. 2.5: Otsu 2.7 sec, Sauvola integral images 2.9 sec.

Same author Shafait et. al. 2008 for 2530x3300 images: "Mean running time for the Otsu’s binarization method was 2.0 secs whereas our algorithm took a mean running time of 2.8 secs."

And then in Lazzara 2015, p. 16, Table 6 the runtimes for 125 images (A4, 300 dpi) in seconds compare:

In my experience the performance measures in scientific papers do not always hold what they promise. The priority of academic scientists is to publish, and the focus is on complexity in terms of O(x) for the core of the method. You never know how carefully the code is tuned and if the compared methods are tuned to the same level. At least you can only compare the speed embedded in the hole process, because data structures can have an impact on other parts.

For quality and robustness Lazzara (Multi Scale Sauvola, MSxk) looks promising.

amitdo commented 3 years ago

About Multi Scale Sauvola. The license of the reference implementation is GPL. They say it only better than sauvola for scans from magazines.

About k=0.34. Leptonica uses 0.4 in its example programs. I also read a paper that mentions 0.34 but says 0.4 is the optimal value for OCR based on actual testing and evaluation using an OCR software.

MerlijnWajer commented 3 years ago

@wollmers

A few other thoughts:

  1. There are faster sauvola binarisation methods out there than what Leptonica uses (I think). I took code from here https://github.com/chungkwong/binarizer and ported it to Cython for some project (https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/cython/sauvola.pyx) - could try to port that to C (and change the license) and see if it is faster than the Leptonica method. But the quality matters more right now, I think. (EDIT: thresholding the below image takes 0.11s on my laptop with the cython implementation, so that might indicate it's faster, looking at the difference in time you reported)

The claimed innovation of Chungkwong 2019 is only a reduction in space complexity as it doesn't need full size integral images. It provides the same result as Sauvola. Using less space can have an impact on runtime (alloc is expensive, cache misses).

Yes, that's what I was alluding to. I wasn't trying to suggest it was better than regular Sauvola, just faster.

On that note, there are a few other things to take into account when measuring speed in Tesseract, too. Maybe it's too obvious but I'd just like to mention this in particular: I've had Tesseract run for hours on a newspaper page when Otsu was used, but with Sauvola thresholding it takes a mere ~40 seconds to OCR (great for a full size newspaper page). So the quality of the binarisation can also cause slow downs later on, when the results needs to be interpreted / segmented, etc. (That was documented here #3377)

@amitdo

About Multi Scale Sauvola. The license of the reference implementation is GPL. They say it only better than sauvola for scans from magazines.

Which implementation are you talking about specifically? Perhaps re-implementing it might not be incredibly hard, or maybe the authors are willing re-license that specific part under a difference license. (And if there's a way to have Tesseract invoke a script/binary then that is less of a problem)

MerlijnWajer commented 3 years ago

@bertsky - this might make it easier to run some sbb_binarization or scribo algorithms...

I have written up a quick hack to run binarisation with an external binary/script, the patch is here [1], maybe this can help experimenting with external thresholding and binarisation tooling. I am not recommending merging this, or even necessarily pursuing this strategy, but providing this for the purpose of making experimentation easier. (I might add the estimated dpi as param, perhaps pass the full colour image (if avail) as opposed to the grayscaled image in a future revision)

With that patch in place, one can use a tool like this [2] to perform binarisation. The example tool is just like the sauvola code in Tesseract, but with 0.34 as k parameter. Interestingly enough, this also results in the following runtime warning (I see now the discussion on the binarisation PR, so this is known) that I don't see when running with Tesseract (maybe Tesseract silences/redirects them):

Warning in pixSauvolaBinarizeTiled: tile width too small; nx reduced to 151
Warning in pixSauvolaBinarizeTiled: tile height too small; ny reduced to 151

Run like this:

tesseract -c thresholding_method=3 -c binarise_tool=/home/merlijn/archive/tesseract-src/simple-bin/main ~/Downloads/45_burnin-for-you_blue-yster-cult_jp2/45_burnin-for-you_blue-yster-cult_0003.jp2 -

One thing to note is that the tool needs to output both the binary and thresholded image, since Tesseract seems to use that later on in TextordPage. It is possible to make the thresholded image the same as the binary image (in case some method you are using does not output thresholds), but I don't know if/how that affects the results; it seems to "work" in a simple case. The binary image must be 1bpp.

[1] https://archive.org/~merlijn/tesseract-binarisation/0001-HACK-WIP-allow-binarisation-with-external-tool.patch [2] https://archive.org/~merlijn/tesseract-binarisation/main.c

stweil commented 3 years ago

Instead of extending Tesseract to use an external tool for binarization, we could also add a parameter which either provides the filename of the binarized image or which forces Tesseract to look for a binarized version of its input image (same filename, but with an added .binbefore the filename extension).

Robyer commented 3 years ago

Instead of extending Tesseract to use an external tool for binarization, we could also add a parameter which either provides the filename of the binarized image or which forces Tesseract to look for a binarized version of its input image (same filename, but with an added .binbefore the filename extension).

It must be possible to provide binarised image directly when working without direct access to files (e.g. on Android). So just extend API and add method to provide both color (grayscale) image and binarized one (as I suggested previously https://github.com/tesseract-ocr/tesseract/issues/3083#issuecomment-762537869 ).

MerlijnWajer commented 3 years ago

Instead of extending Tesseract to use an external tool for binarization, we could also add a parameter which either provides the filename of the binarized image or which forces Tesseract to look for a binarized version of its input image (same filename, but with an added .binbefore the filename extension).

It must be possible to provide binarised image directly when working without direct access to files (e.g. on Android). So just extend API and add method to provide both color (grayscale) image and binarized one (as I suggested previously #3083 (comment) ).

This would not provide Tesseract with the thresholds values from binarisation it appears to use, and it also does not provide the binarisation tool(s) with the Tesseract estimated DPI, which could be useful to some, though.

amitdo commented 3 years ago

@stweil, I was planning to improve this feature after the release of 5.0.0. Do you think we should do it for 5.0.0 ?

stweil commented 3 years ago

If it is possible to improve it later either in a bug fix release or in a minor update 5.1.0 without breaking the 5.0.0 API, that's fine, too.

amitdo commented 3 years ago

There is API and there is ABI. In any case I'm not sure if we won't need to break them with all the changes people requested here.

bertsky commented 3 years ago

It must be possible to provide binarised image directly when working without direct access to files (e.g. on Android). So just extend API and add method to provide both color (grayscale) image and binarized one (as I suggested previously #3083 (comment) ).

This would not provide Tesseract with the thresholds values from binarisation it appears to use, and it also does not provide the binarisation tool(s) with the Tesseract estimated DPI, which could be useful to some, though.

True, but we could extend the API to allow providing both the binarized image and the threshold levels, if available (perhaps with arity polymorphism). As to the concrete identifiers, we would have to decide between the terminology of TessBaseAPI (which already contains GetThresholdedImage) and PageIterator (which already contains GetBinaryImage)...

And regarding estimated DPI as input to binarization, I agree this is important. But we could just allow querying for that. (Tesseract should really have exposed its internal estimated_res_ via API all along. Even if the current DPI plausibilization and fallback estimation is not perfect, it is already useful in itself and can still be improved further on.)

amitdo commented 3 years ago

The parameters for the new Leptonica based binarization methods are now exposed to the users. The *_size parameters are relative to the DPI of the input image.

tesseract --print-parameters | grep thresholding_
thresholding_method 0   Thresholding method: 0 = Otsu, 1 = LeptonicaOtsu, 2 = Sauvola
thresholding_debug  0   Debug the thresholding process
thresholding_window_size    0.33    Window size for measuring local statistics (to be multiplied by image DPI). This parameter is used by the Sauvola thresolding method
thresholding_kfactor    0.34    Factor for reducing threshold due to variance. This parameter is used by the Sauvola thresolding method. Normal range: 0.2-0.5
thresholding_tile_size  0.33    Desired tile size (to be multiplied by image DPI). This parameter is used by the LeptonicaOtsu thresolding method
thresholding_smooth_kernel_size 0   Size of convolution kernel applied to threshold array (to be multiplied by image DPI). Use 0 for no smoothing. This parameter is used by the LeptonicaOtsu thresolding method
thresholding_score_fraction 0.1 Fraction of the max Otsu score. This parameter is used by the LeptonicaOtsu thresolding method. For standard Otsu use 0.0, otherwise 0.1 is recommended
rmast commented 3 years ago

I've seen an early djvu-binarization algorithm in https://github.com/hsnr-gamera/gamera-4/blob/master/include/plugins/threshold.hpp. The patent for better djvu-binarization is to expire soon: https://github.com/jwilk/didjvu/issues/21 However the result is still not stable with inverted(light text on dark background), inkjet printed, text that came folded in a snail-mail.

So I wondered whether the confidence of a found word could help to even improve the binarization-result, by locally moving the binarization threshold to try to find a (part of a) missing or incomplete letter that should be there according to the dictionary. That strategy however would not allow a separate binarization in advance, but an integrated binarization.

With the scanner as a source you could even rescan a detail at a higher resolution to improve the confidence.

bertsky commented 3 years ago

@rmast, I was not aware DjVu contained its own thresholding algorithm, but don't see why we should care, as long as its patented or otherwise unfree – we have plenty of other algorithms to compare and integrate. Or defer to external binarization (which would also be able to cover newer data-driven methods).

So I wondered whether the confidence of a found word could help to even improve the binarization-result, by locally moving the binarization threshold to try to find a (part of a) missing or incomplete letter that should be there according to the dictionary. That strategy however would not allow a separate binarization in advance, but an integrated binarization.

That's a great idea. It makes sense for very difficult material and more complex OCR pipelines, for example by doing page-level binarization at different thresholds, followed by layout analysis on one of them and text recognition on each of them, and then pick the results from the threshold with highest text confidence – per word/line/region. But Tesseract (like most state-of-the-art engines) does not require (and use) bitonal input – the text recognition will always take the grayscale version image (unless of course in case of external binarization), only the layout analysis uses the bitonal image. And we do not have a clear quality signal / confidence measure for layout, I'm afraid.

rmast commented 3 years ago

If Tesseract is trained on greylevel content then indeed binarization doesn't matter. You're right there is a difference between the confidence of the original picture and the confidence of the intermediate picture saved by --oem 1 -c tessedit_write_images=1 -c tessedit_create_hocr=1

gunnar-ifp commented 2 years ago

Hi. I noticed this new feature 2 days ago and it seemed like a cool new feature to fix issues we have with for example Powerpoint slides that have white on dark text on one side and dark on white text on the other. With method 0, one side gets the short end of the stick and completely disappears. Both method 1 and 2 work fine there.

But I also noticed a lot of new random characters for scans, especially if more a greyish/noisy background. (I can problably filter these out from the using the confidence supplied via hocr.) I stored the internal images for these files. Method 2 doesn't look much different from 0, it leaves more grain, which explains the random characters. Method 1 goes completly bonkers. The whole image is filled with grain artefacts. Depending on DPI there is more or less there, but plenty.

Setting thresholding_smooth_kernel_size to 1 fixed this problem. With the right DPI it actually produces the best looking binary image. I strongly recomment this switch to be set to something else than 0 by default.

Update: While this smoothing fixed the grain issue, it introduces other problems, namely destroying OCR for very small font sizes. I am using the ALFA waffenkatalog from archive.org since it contains densly packed text, and the first 70 pages have been OCRed with Abby, which used some tricky compression, so I can test my PDF rendering, OCR and PDF text layout extraction all in one go :). Since I render PDFs for OCRs at 300 DPI, the dense text is very small (such small text documents should probably be rendered at 600 DPI, but I only know about this after OCR). Method 0 never had a problem with this. Method 1 might actually be even better, but if smoothing is enabled, it loses many of the small characters. Bummer.