Closed zvezdochiot closed 3 years ago
Hi,
Afaik, there is no any math description of algorithms used in ST. STU is a fork of ST and inherits all image processing algorithms from it. So, the author of original ST (Joseph Artsimovich) is the one who know them in details. You may try to ask him.
There are some references to scientific papers in References Tab of ST's About dialog that might be helpful:
[1] Young, Ian T., and Lucas J. Van Vliet. "Recursive implementation of the Gaussian filter." Signal processing 44.2 (1995): 139-151.
[2] Triggs, Bill, and Michaël Sdika. "Boundary conditions for Young-van Vliet recursive filtering." Signal Processing, IEEE Transactions on 54.6 (2006): 2365-2367.
[3] Geusebroek, Jan-Mark, Arnold WM Smeulders, and Joost Van de Weijer. "Fast anisotropic gauss filtering." Image Processing, IEEE Transactions on 12.8 (2003): 938-943.
[4] Bischof, Christian, George Corliss, and Andreas Griewank. "Structured second-and higher-order derivatives through univariate Taylor series." Optimization Methods and Software 2.3-4 (1993): 211-232.
[5] Pottmann, Helmut, and Stefan Leopoldseder. "A concept for parametric surface fitting which avoids the parametrization problem." Computer Aided Geometric Design 20.6 (2003): 343-362.
[6] Wang, Wenping, Helmut Pottmann, and Yang Liu. "Fitting b-spline curves to point clouds by squared distance minimization." Preprint, University of Hong Kong (2004).
[7] Otsu, Nobuyuki. "A threshold selection method from gray-level histograms." Automatica 11.285-296 (1975): 23-27.
[8] Mokji, Musa Mohd, and S. A. R. A. Bakar. "Adaptive thresholding based on co-occurrence matrix edge information." Modelling & Simulation, 2007. AMS'07. First Asia International Conference on. IEEE, 2007.
[9] Niblack, Wayne. An introduction to digital image processing. Englewood Cliffs, N. J., Prentice Hall (1986) 115-116
[10] Gatos, Basilios, Ioannis Pratikakis, and Stavros J. Perantonis. "An adaptive binarization technique for low quality historical documents." Document Analysis Systems VI. Springer Berlin Heidelberg, 2004. 102-113.
[11] Sauvola, Jaakko, and Matti Pietikäinen. "Adaptive document image binarization." Pattern recognition 33.2 (2000): 225-236.
[12] Wolf, Christian, Jean-Michel Jolion, and Francoise Chassaing. "Text localization, enhancement and binarization in multimedia documents." Pattern Recognition, 2002. Proceedings. 16th International Conference on. Vol. 2. IEEE, 2002.
[13] Van Herk, Marcel. "A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels." Pattern Recognition Letters 13.7 (1992): 517-521.
[14] Breuel, Thomas M. "Finding lines under bounded error." Pattern recognition 29.1 (1996): 167-178.
[15] Vincent, Luc. "Morphological grayscale reconstruction in image analysis: applications and efficient algorithms." Image Processing, IEEE Transactions on 2.2 (1993): 176-201.
[16] Lu, Shijian, and Chew Lim Tan. "Binarization of badly illuminated document images through shading estimation and compensation." Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on. Vol. 1. IEEE, 2007.
[17] Cao, Huaigu, Xiaoqing Ding, and Changsong Liu. "Rectifying the bound document image captured by the camera: A model based approach." Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on. IEEE, 2003.
[18] Blanc, Carole, and Christophe Schlick. "X-splines: A spline model designed for the end-user." Proceedings of the 22nd annual conference on Computer graphics and interactive techniques. ACM, 1995.
[19] Danielsson, Per-Erik. "Euclidean distance mapping." Computer Graphics and image processing 14.3 (1980): 227-248.
[20] Meijster, Arnold, Jos BTM Roerdink, and Wim H. Hesselink. "A general algorithm for computing distance transforms in linear time." Mathematical Morphology and its applications to image and signal processing. Springer US, 2002. 331-340.
Probably some details were explained by author (Tulon) on ru-board forum here or here. There are 250+ pages of discussions.
Other way is to deconstruct algo from the code. I guess, starting from here. I would be capable to explain the code, but my knowledge of image processing isn't enough to explain it in a terms of math or recognize usage of some well-known approaches from research papers above in it.
A pseudo-mathematical description should be made from a ready-made algorithm. Otherwise it will be of little use. The main thing is to understand that such a description is necessary to understand the limitations of the segmentater. At this stage, the segmentater is able to pick out most of the illustrations. But it does not distinguish between solid illustrations and patterns. And this does not allow using different color correction algorithms for these fundamentally different types of illustrations.
Could you provide with example of what you mean by solid illustration?
How about that brief explanation:
If you enable "Tools/Settings/General/Debug mode" and then select an image at Output processing stage that's in Mixed mode (has both text to binarize and illustrations) and select "Regenerate" in context menu you'll get additional "Debug images" (tabs on top of result preview panel).
These are ~ : to_be_normalized, downscaled, preproc_method1, preproc_method2, raw_diff, approx_diff, raw_vs_approx_diff, use_method2, area_to_consider, mask, eroded, lines_extended, background, normalized_illumination, smoothed, stretched, eroded, dilated, gray_gradient, marker, reconstructed, reconstructed_inverted, holes_filled, bw_mask, bw_mask with zones, binarized_and_cropped, edges_smoothed, big_components_unified, voronoi, despeckled
.
All of these are results of some processing step or sub-steps. Let's refer to these debug images further.
Ok, let's consider simplest case: without dewarping (not a OutputGenerator::processWithoutDewarping()).
/ // OutputGenerator::normalizeIlluminationGray() /
to_be_normalized
debug image.Image color conversion is implemented in class GreyImage
. For RGB images it's done pixel by pixel with help of Qt's qGray()
method which returns value of grey in 0..255 via formulae (r*11+g*16+b*5)/32
.
It's need to be highlighted that although ST uses standard Qt class (QTransform) for storing the matrix of transformations and uses standard Qt methods to apply it to such objects as points and regions, the standard Qt's image rendering isn't used. I mean ST not invoke transform() method of QImage but implements its own rendering of piece of original image to a new coordinate system. I suspect that's because Qt doesn't support scaling with filter other then bilinear (no bicubic!) by default. And ST often renders images in dpi less or bigger then original. So, ST uses own image renderer that's implemented in method transformGeneric()
. As I understand, the implementation uses some variation of the "Superscaling" algorithm there instead of filters.
Using to_be_normolized
image ST constructs "PolynomialSurface" for it. This surface is calculated not for original image, but for the sample resized to 300x300 pixels (keeping aspect ratio) as it requires a lot of calculation. The result of downscaling may be found in downscaled
debug image (note: own implementation of transformation is used here too).
I would omit the details of "PolynomialSurface" calculation. It looks like we won't need them. This processing produces 10 next debug images up to background
.
The obtained "PolynomialSurface" is rendered in grayscale in original image size. It may be seen in background
debug image.
The to_be_normalized
image illumination is normalized with help of PolynomialSurface and result may be seen in normalized_illumination
debug image.
Note: the result is grayscale
Next:
/ // OutputGenerator::smoothToGrayscale( /
normalized_illumination
image (grayscale) is smoothed with savGolFilter
(Savitzky-Golay Filter). Author commentary:
* The Savitzky-Golay method is equivalent to fitting a small neighborhood
* around each pixel to a polynomial, and then recalculating the pixel
* value from it. In practice, it achieves the same results without fitting
* a polynomial for every pixel, so it performs quite well.
... <my cut>
* Good results for 300 dpi scans are achieved with 7x7 window and 4x4 degree.
(SavGolFilter.h)
The size of its window and degrees are hardcoded for images with minimal (hor/vert) dpi: 200, 400, 800 and 800+.
The result might be seen in smoothed
debug image.
/ // OutputGenerator::estimateBinarizationMask() /
The smoothed
image is going to be binarized and to do that we need to create a mask of areas (illustrations) that would be excluded from this process. The resulting mask could be seen in bw_mask
debug image. 8 debug images between smoothed
and bw_mask
are sub-steps of this process.
The mask is estimated on downscaled version of the smoothed
image. It's downscaled to 300dpi if needed.
The pictures are detected with OutputGenerator::detectPictures()
in a following way:
2.1. The downscaled image is "stretched to Gray Range" with stretchGrayRange
. Authors commentary:
// We stretch the range of gray levels to cover the whole
// range of [0, 255]. We do it because we want text
// and background to be equally far from the center
// of the whole range. Otherwise text printed with a big
// font will be considered a picture.
The result could be seen as stretched
debug image. Technically it constructs a color histogram, then finds min and max levels of gray (from pure black to white and from pure white to black) that cumulatively contains less then 1% of dark and white pixels (1% of image area). Exclude these grayscale levels from color table and construct a new image there all non-excluded levels of gray are evenly distributed between 255 levels of gray again. So some dark pixels may become darker and some light pixels lighter.
2.2. Then the stretched
debug image is eroded and the result could be seen in eroded
debug image. Also stretched
image is dilated and the result could be seen in dilated
debug image. Both morphological operations are made with a "Brick" size 3x3. Note: the dilate and erode algorithms are implemented to operate with grayscale image, bot the b/w, so they spread lighter or darker pixels except of pure black or white ones. Probably with some gradient - didn't check.
2.3. Both eroded
and dilated
versions of stretched
image are combined together (dilated version is preliminary inverted) and the result could be seen is gray_gradient
debug image. We don't need stretched
, eroded
and dilated
versions anymore.
2.4. The gray_gradient
version is eroded (as a grayscale) again, but now with Brick size = 35x35. The result could be seen as a marker
debug image.
2.5. gray_gradient
and marker
versions are combined to make reconstructed
debug image. The combination is made with seedFillGrayInPlace()
which is in-place version of seedFillGray()
. The last has following commentary from author:
/**
* \brief Spread darker colors from seed as long as mask allows it.
*
* The result of this operation is an image where some areas are lighter
* than in \p mask, because there were no dark paths linking them to dark
* areas in \p seed.
* \par
* \p seed is allowed to contain pixels darker than the corresponding pixels
* in \p mask. Such pixels will be made equal to their mask values.
* \par
* The underlying code implements Luc Vincent's hybrid seed-fill algorithm:
* http://www.vincent-net.com/luc/papers/93ieeeip_recons.pdf
*/
Values are spread to 8 immediate neighbors.
2.6 Then reconstructed
grayscale image is inverted. The result could be seen as reconstructed_inverted
debug image.
2.7. Then ST creates a frame image same size as reconstructed_inverted
. It's a grayscale image consisting of a 1 pixel frame (black color) and an inner color are white. The reconstructed_inverted
image is applied to frame image in the same way reconstructed
was combined with marker
at 2.5. The result is stored as holes_filled
debug image.
holes_filled
mask (which now referred in code as picture_areas image) is scaled back from 300dpi to original size. And binarized. The level of binarization is hardcoded to 48.Note: the original ST as well as Tulon's Experimental are using BinaryThreshold::mokjiThreshold(picture_areas, 5, 26)
instead of the hardcoded 48 value. But STU and ST Advanced have 48. I think this is a change that both STU and STA inherited from Petr Kovář's Scan Tailor Enhanced. Commits:
https://github.com/scantailor/scantailor/commit/b938f6c54171fb84c0a64c20a9327f4b8840bb0b
https://github.com/scantailor/scantailor/commit/df34c32bf98b50f13d4256aa6c5ecb2aed895370
Was this an improvement? I hope so.
Binarization operation itself is a pretty straightforward.
The result is stored as a bw_mask
debug image.
//// /// continue processing in a parent function ////
At this moment bw_mask
may be used to search rectangular "picture zones" in it to allow user modify them later manually in Layers tab. This is optional and performed only if no such search was done before.
I'll omit explanation of norm_illum_color
as it appears only in case "Equalize illumination" was disabled in mixed mode to rallback changes made on step 4 of normalizeIlluminationGray.
All picture zones (auto-found, user-defined) are drawn on top of bw_mask
and the result is available via bw_mask with zones
debug image.
/// // OutputGenerator::binarize( ///
Now we shall binarize our smoothed
(result of OutputGenerator::smoothToGrayscale()
) with help of mask bw_mask with zones
(result of OutputGenerator::estimateBinarizationMask()
and few further steps) and store it as a debug image binarized_and_cropped
.
Binarization is pretty straightforward. Parts of the mask are excluded from the result. The threshold is calculated (in STU, didn't check others) as Otsu Threshold of grayscale histogram of the image plus value adjustment that may be set by user in page options (binarization threshold slider) bounded to [30, 225]. Author notes:
// Hard-bounding threshold values is necessary for example
// if all the content went into the picture mask.
I think this is for case Otsu Threshold return an extreme value.
The result is stored as binarized_and_cropped
debug image.
/// // continue processing in a parent function ///
So now we have a binarized image without illustrations in binarized_and_cropped
. Note: it's b/w now - not grayscale.
In case Smoothing isn't disabled in "STU/Tools/Settings/Output/Black & White Mode/Disable smoothing" the binarized_and_cropped
image is smoothed and stored in debug image edges_smoothed
. This is a smoothing specifically developed for smoothing letter edges and its implementation may be found in OutputGenerator::morphologicalSmoothInPlace()
. As I understand, 6 patterns (and their rotations) are applied to the edges of the letters.
The resulting binarized image (eigher smoothed or not) is despeckled. The result is shown in despeckled
debug image. big_components_unified
and voronoi
debug images are part of despeckling but I would not go into details for it.
The norm_illum_color
image is combined with bw mask. Now we have illustrations and binarized content on single image. At this poing the true black and true white pixels are reserved. Authors commentary:
/**
* In picture areas we make sure we don't use pure black and pure white colors.
* These are reserved for text areas. This behavior makes it possible to
* detect those picture areas later and treat them differently, for example
* encoding them as a background layer in DjVu format.
*/
Fill zones are applied.
...
PROFIT!!!
Hi @trufanov-nok .
Yes. This is a decent description. I did not understand the purpose of some of the listed procedures in the code. You can already work with such a description. The rest of the details are in the code and don't need to be written. Move this description into a separate text file and put it in the repo.
I don't think this description should be a part of sourcecode as a file. I may move it to Discussions
@trufanov-nok say:
I don't think this description should be a part of sourcecode as a file.
Why? Excellent reference material for both issues and discussions. The foundation. Why hang everything in the air again?
I see the need to publish the applied binarization algorithm in a pseudo-mathematical, algorithmic form.
The fact is that it is this algorithm that determines the restrictions on the further classification of individual parts of the image. At the moment, the classification is the simplest: text / illustration. But this classification can be expanded. To do this, one must see the limitations of the binarization algorithm.