trufanov-nok / scantailor-universal

ScanTailor Universal - a fork based on Enhanced+Featured+Master versions of ST
http://scantailor.org
Other
183 stars 16 forks source link

Mathematics? #88

Closed zvezdochiot closed 3 years ago

zvezdochiot commented 3 years ago

I see the need to publish the applied binarization algorithm in a pseudo-mathematical, algorithmic form.

The fact is that it is this algorithm that determines the restrictions on the further classification of individual parts of the image. At the moment, the classification is the simplest: text / illustration. But this classification can be expanded. To do this, one must see the limitations of the binarization algorithm.

trufanov-nok commented 3 years ago

Hi,

Afaik, there is no any math description of algorithms used in ST. STU is a fork of ST and inherits all image processing algorithms from it. So, the author of original ST (Joseph Artsimovich) is the one who know them in details. You may try to ask him.

There are some references to scientific papers in References Tab of ST's About dialog that might be helpful:

[1] Young, Ian T., and Lucas J. Van Vliet. "Recursive implementation of the Gaussian filter." Signal processing 44.2 (1995): 139-151.

[2] Triggs, Bill, and Michaël Sdika. "Boundary conditions for Young-van Vliet recursive filtering." Signal Processing, IEEE Transactions on 54.6 (2006): 2365-2367.

[3] Geusebroek, Jan-Mark, Arnold WM Smeulders, and Joost Van de Weijer. "Fast anisotropic gauss filtering." Image Processing, IEEE Transactions on 12.8 (2003): 938-943.

[4] Bischof, Christian, George Corliss, and Andreas Griewank. "Structured second-and higher-order derivatives through univariate Taylor series." Optimization Methods and Software 2.3-4 (1993): 211-232.

[5] Pottmann, Helmut, and Stefan Leopoldseder. "A concept for parametric surface fitting which avoids the parametrization problem." Computer Aided Geometric Design 20.6 (2003): 343-362.

[6] Wang, Wenping, Helmut Pottmann, and Yang Liu. "Fitting b-spline curves to point clouds by squared distance minimization." Preprint, University of Hong Kong (2004).

[7] Otsu, Nobuyuki. "A threshold selection method from gray-level histograms." Automatica 11.285-296 (1975): 23-27.

[8] Mokji, Musa Mohd, and S. A. R. A. Bakar. "Adaptive thresholding based on co-occurrence matrix edge information." Modelling & Simulation, 2007. AMS'07. First Asia International Conference on. IEEE, 2007.

[9] Niblack, Wayne. An introduction to digital image processing. Englewood Cliffs, N. J., Prentice Hall (1986) 115-116

[10] Gatos, Basilios, Ioannis Pratikakis, and Stavros J. Perantonis. "An adaptive binarization technique for low quality historical documents." Document Analysis Systems VI. Springer Berlin Heidelberg, 2004. 102-113.

[11] Sauvola, Jaakko, and Matti Pietikäinen. "Adaptive document image binarization." Pattern recognition 33.2 (2000): 225-236.

[12] Wolf, Christian, Jean-Michel Jolion, and Francoise Chassaing. "Text localization, enhancement and binarization in multimedia documents." Pattern Recognition, 2002. Proceedings. 16th International Conference on. Vol. 2. IEEE, 2002.

[13] Van Herk, Marcel. "A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels." Pattern Recognition Letters 13.7 (1992): 517-521.

[14] Breuel, Thomas M. "Finding lines under bounded error." Pattern recognition 29.1 (1996): 167-178.

[15] Vincent, Luc. "Morphological grayscale reconstruction in image analysis: applications and efficient algorithms." Image Processing, IEEE Transactions on 2.2 (1993): 176-201.

[16] Lu, Shijian, and Chew Lim Tan. "Binarization of badly illuminated document images through shading estimation and compensation." Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on. Vol. 1. IEEE, 2007.

[17] Cao, Huaigu, Xiaoqing Ding, and Changsong Liu. "Rectifying the bound document image captured by the camera: A model based approach." Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on. IEEE, 2003.

[18] Blanc, Carole, and Christophe Schlick. "X-splines: A spline model designed for the end-user." Proceedings of the 22nd annual conference on Computer graphics and interactive techniques. ACM, 1995.

[19] Danielsson, Per-Erik. "Euclidean distance mapping." Computer Graphics and image processing 14.3 (1980): 227-248.

[20] Meijster, Arnold, Jos BTM Roerdink, and Wim H. Hesselink. "A general algorithm for computing distance transforms in linear time." Mathematical Morphology and its applications to image and signal processing. Springer US, 2002. 331-340.

Probably some details were explained by author (Tulon) on ru-board forum here or here. There are 250+ pages of discussions.

Other way is to deconstruct algo from the code. I guess, starting from here. I would be capable to explain the code, but my knowledge of image processing isn't enough to explain it in a terms of math or recognize usage of some well-known approaches from research papers above in it.

zvezdochiot commented 3 years ago

A pseudo-mathematical description should be made from a ready-made algorithm. Otherwise it will be of little use. The main thing is to understand that such a description is necessary to understand the limitations of the segmentater. At this stage, the segmentater is able to pick out most of the illustrations. But it does not distinguish between solid illustrations and patterns. And this does not allow using different color correction algorithms for these fundamentally different types of illustrations.

trufanov-nok commented 3 years ago

Could you provide with example of what you mean by solid illustration?

zvezdochiot commented 3 years ago
trufanov-nok commented 3 years ago

How about that brief explanation:

If you enable "Tools/Settings/General/Debug mode" and then select an image at Output processing stage that's in Mixed mode (has both text to binarize and illustrations) and select "Regenerate" in context menu you'll get additional "Debug images" (tabs on top of result preview panel). These are ~ : to_be_normalized, downscaled, preproc_method1, preproc_method2, raw_diff, approx_diff, raw_vs_approx_diff, use_method2, area_to_consider, mask, eroded, lines_extended, background, normalized_illumination, smoothed, stretched, eroded, dilated, gray_gradient, marker, reconstructed, reconstructed_inverted, holes_filled, bw_mask, bw_mask with zones, binarized_and_cropped, edges_smoothed, big_components_unified, voronoi, despeckled. All of these are results of some processing step or sub-steps. Let's refer to these debug images further.

Ok, let's consider simplest case: without dewarping (not a OutputGenerator::processWithoutDewarping()).

  1. Consider we have original scan (usually 2 pages on one image) and processing instructions. Instructions are: transformation matrix (result of page orientation and deskew processing steps), pre-crop area (the rect that splits the scan into pages at Split Pages processing step), post crop area (the rect of content inside pre-crop area, that's defined at Select Content step), margins set by Page Layout processing step.

/ // OutputGenerator::normalizeIlluminationGray() /

  1. ST takes a part of original scan corresponded to content area with some fixed margins (required) around it, convert it to grayscale and then apply matrix of transformation. The result is saved to to_be_normalized debug image.

Image color conversion is implemented in class GreyImage. For RGB images it's done pixel by pixel with help of Qt's qGray() method which returns value of grey in 0..255 via formulae (r*11+g*16+b*5)/32.

It's need to be highlighted that although ST uses standard Qt class (QTransform) for storing the matrix of transformations and uses standard Qt methods to apply it to such objects as points and regions, the standard Qt's image rendering isn't used. I mean ST not invoke transform() method of QImage but implements its own rendering of piece of original image to a new coordinate system. I suspect that's because Qt doesn't support scaling with filter other then bilinear (no bicubic!) by default. And ST often renders images in dpi less or bigger then original. So, ST uses own image renderer that's implemented in method transformGeneric(). As I understand, the implementation uses some variation of the "Superscaling" algorithm there instead of filters.

  1. Using to_be_normolized image ST constructs "PolynomialSurface" for it. This surface is calculated not for original image, but for the sample resized to 300x300 pixels (keeping aspect ratio) as it requires a lot of calculation. The result of downscaling may be found in downscaled debug image (note: own implementation of transformation is used here too). I would omit the details of "PolynomialSurface" calculation. It looks like we won't need them. This processing produces 10 next debug images up to background.

  2. The obtained "PolynomialSurface" is rendered in grayscale in original image size. It may be seen in background debug image.

  3. The to_be_normalized image illumination is normalized with help of PolynomialSurface and result may be seen in normalized_illumination debug image. Note: the result is grayscale

Next:

/ // OutputGenerator::smoothToGrayscale( /

  1. The normalized_illumination image (grayscale) is smoothed with savGolFilter (Savitzky-Golay Filter). Author commentary:
    * The Savitzky-Golay method is equivalent to fitting a small neighborhood
    * around each pixel to a polynomial, and then recalculating the pixel
    * value from it.  In practice, it achieves the same results without fitting
    * a polynomial for every pixel, so it performs quite well.
    ... <my cut>
    * Good results for 300 dpi scans are achieved with 7x7 window and 4x4 degree.

    (SavGolFilter.h)

The size of its window and degrees are hardcoded for images with minimal (hor/vert) dpi: 200, 400, 800 and 800+.

The result might be seen in smoothed debug image.

/ // OutputGenerator::estimateBinarizationMask() /

  1. The smoothed image is going to be binarized and to do that we need to create a mask of areas (illustrations) that would be excluded from this process. The resulting mask could be seen in bw_mask debug image. 8 debug images between smoothed and bw_mask are sub-steps of this process.

  2. The mask is estimated on downscaled version of the smoothed image. It's downscaled to 300dpi if needed.

  3. The pictures are detected with OutputGenerator::detectPictures() in a following way: 2.1. The downscaled image is "stretched to Gray Range" with stretchGrayRange. Authors commentary:

    // We stretch the range of gray levels to cover the whole
    // range of [0, 255].  We do it because we want text
    // and background to be equally far from the center
    // of the whole range.  Otherwise text printed with a big
    // font will be considered a picture.

    The result could be seen as stretched debug image. Technically it constructs a color histogram, then finds min and max levels of gray (from pure black to white and from pure white to black) that cumulatively contains less then 1% of dark and white pixels (1% of image area). Exclude these grayscale levels from color table and construct a new image there all non-excluded levels of gray are evenly distributed between 255 levels of gray again. So some dark pixels may become darker and some light pixels lighter.

2.2. Then the stretched debug image is eroded and the result could be seen in eroded debug image. Also stretched image is dilated and the result could be seen in dilated debug image. Both morphological operations are made with a "Brick" size 3x3. Note: the dilate and erode algorithms are implemented to operate with grayscale image, bot the b/w, so they spread lighter or darker pixels except of pure black or white ones. Probably with some gradient - didn't check.

2.3. Both eroded and dilated versions of stretched image are combined together (dilated version is preliminary inverted) and the result could be seen is gray_gradient debug image. We don't need stretched, eroded and dilated versions anymore.

2.4. The gray_gradient version is eroded (as a grayscale) again, but now with Brick size = 35x35. The result could be seen as a marker debug image.

2.5. gray_gradient and marker versions are combined to make reconstructed debug image. The combination is made with seedFillGrayInPlace() which is in-place version of seedFillGray(). The last has following commentary from author:

/**
 * \brief Spread darker colors from seed as long as mask allows it.
 *
 * The result of this operation is an image where some areas are lighter
 * than in \p mask, because there were no dark paths linking them to dark
 * areas in \p seed.
 * \par
 * \p seed is allowed to contain pixels darker than the corresponding pixels
 * in \p mask.  Such pixels will be made equal to their mask values.
 * \par
 * The underlying code implements Luc Vincent's hybrid seed-fill algorithm:
 * http://www.vincent-net.com/luc/papers/93ieeeip_recons.pdf
 */

Values are spread to 8 immediate neighbors.

2.6 Then reconstructed grayscale image is inverted. The result could be seen as reconstructed_inverted debug image.

2.7. Then ST creates a frame image same size as reconstructed_inverted. It's a grayscale image consisting of a 1 pixel frame (black color) and an inner color are white. The reconstructed_inverted image is applied to frame image in the same way reconstructed was combined with marker at 2.5. The result is stored as holes_filled debug image.

  1. Then holes_filled mask (which now referred in code as picture_areas image) is scaled back from 300dpi to original size. And binarized. The level of binarization is hardcoded to 48.

Note: the original ST as well as Tulon's Experimental are using BinaryThreshold::mokjiThreshold(picture_areas, 5, 26) instead of the hardcoded 48 value. But STU and ST Advanced have 48. I think this is a change that both STU and STA inherited from Petr Kovář's Scan Tailor Enhanced. Commits: https://github.com/scantailor/scantailor/commit/b938f6c54171fb84c0a64c20a9327f4b8840bb0b https://github.com/scantailor/scantailor/commit/df34c32bf98b50f13d4256aa6c5ecb2aed895370

Was this an improvement? I hope so.

Binarization operation itself is a pretty straightforward.

The result is stored as a bw_mask debug image.

//// /// continue processing in a parent function ////

  1. At this moment bw_mask may be used to search rectangular "picture zones" in it to allow user modify them later manually in Layers tab. This is optional and performed only if no such search was done before.

  2. I'll omit explanation of norm_illum_color as it appears only in case "Equalize illumination" was disabled in mixed mode to rallback changes made on step 4 of normalizeIlluminationGray.

  3. All picture zones (auto-found, user-defined) are drawn on top of bw_mask and the result is available via bw_mask with zones debug image.

/// // OutputGenerator::binarize( ///

  1. Now we shall binarize our smoothed (result of OutputGenerator::smoothToGrayscale()) with help of mask bw_mask with zones (result of OutputGenerator::estimateBinarizationMask() and few further steps) and store it as a debug image binarized_and_cropped.

  2. Binarization is pretty straightforward. Parts of the mask are excluded from the result. The threshold is calculated (in STU, didn't check others) as Otsu Threshold of grayscale histogram of the image plus value adjustment that may be set by user in page options (binarization threshold slider) bounded to [30, 225]. Author notes:

    // Hard-bounding threshold values is necessary for example
    // if all the content went into the picture mask.

    I think this is for case Otsu Threshold return an extreme value.

The result is stored as binarized_and_cropped debug image.

/// // continue processing in a parent function ///

  1. So now we have a binarized image without illustrations in binarized_and_cropped. Note: it's b/w now - not grayscale.

  2. In case Smoothing isn't disabled in "STU/Tools/Settings/Output/Black & White Mode/Disable smoothing" the binarized_and_cropped image is smoothed and stored in debug image edges_smoothed. This is a smoothing specifically developed for smoothing letter edges and its implementation may be found in OutputGenerator::morphologicalSmoothInPlace(). As I understand, 6 patterns (and their rotations) are applied to the edges of the letters.

  3. The resulting binarized image (eigher smoothed or not) is despeckled. The result is shown in despeckled debug image. big_components_unified and voronoi debug images are part of despeckling but I would not go into details for it.

  4. The norm_illum_color image is combined with bw mask. Now we have illustrations and binarized content on single image. At this poing the true black and true white pixels are reserved. Authors commentary:

    /**
    * In picture areas we make sure we don't use pure black and pure white colors.
    * These are reserved for text areas.  This behavior makes it possible to
    * detect those picture areas later and treat them differently, for example
    * encoding them as a background layer in DjVu format.
    */
  5. Fill zones are applied.

...

PROFIT!!!

zvezdochiot commented 3 years ago

Hi @trufanov-nok .

Yes. This is a decent description. I did not understand the purpose of some of the listed procedures in the code. You can already work with such a description. The rest of the details are in the code and don't need to be written. Move this description into a separate text file and put it in the repo.

trufanov-nok commented 3 years ago

I don't think this description should be a part of sourcecode as a file. I may move it to Discussions

zvezdochiot commented 3 years ago

@trufanov-nok say:

I don't think this description should be a part of sourcecode as a file.

Why? Excellent reference material for both issues and discussions. The foundation. Why hang everything in the air again?