wanghaisheng / awesome-ocr

A curated list of promising OCR resources
http://wanghaisheng.github.io/ocr-arxiv-daily/
MIT License
1.66k stars 351 forks source link

文本行的标准化 #85

Closed wanghaisheng closed 5 years ago

wanghaisheng commented 6 years ago

Text-Line Normalization The relative position and scale of individual characters in a text-line are important features for Latin and many other scripts. Normalization of text-lines helps in making this information consistent across all text-lines in a given database. There are many normalization methods proposed in the literature. Normalization methods that have been used for various experiments reported in this thesis are described in the sections below. B.1 Image Rescaling Image rescaling is the simplest method to make the heights of all images in a database equal. For a desired image height, a scale can be calculated as following: scale = target_height actual_height This scale is then used to determine the width of the “normalized” image by simply multiplying it with the width of the actual image. target_width = scale ∗ actual_width This normalization is used in the current thesis for some of the OCR experiments reported for Urdu Nastaleeq script. B.2 Zone-Based Normalization Characters in many scripts like Latin, Greek and Devanagari follow certain typographic rules. A text-line in such scripts can be divided into three zones. A baseline passes through the bottom of majority of the characters, and a mean-line is at the middle height from the baseline to the top edge of a text-line. Most of the small characters, like ‘x’, ‘s’, and ‘o’ lie between these two lines. The portion of the characters that extends above the mean-line is termed as ‘ascender’, and that extending below the baseline is termed as the ‘descender’. The zone between the baseline and the meanline is the middle-zone, the zone below the mean-line is the bottom-zone and the zone above the baseline is called the top-zone. A sample text-line in Devanagari script with these three zones is shown in Figure B.1. Rashid et al. [Ras14] proposed a text-line normalization method which uses the above-mentioned three zones. Statistical analysis is carried out to estimate these zones in an image and then each zone is rescaled to a specific height by simple rescaling described in the previous section. This normalization method has been employed for the experiments reported for Devanagari script in this thesis. B.3 Token-Dictionary based Normalization This text-line normalization method is based on a dictionary composed of connected component shapes and associated baseline and x-height information. This dictionary is pre-computed based on a large sample of text-lines with baseline and x-heights derived from alignment of the text-line images with textual ground-truth, together with information about the relative position of Latin characters to the baseline and x-height. Note that for some shapes (e.g., p/P, o/O), the baseline and x-height information may be ambiguous; the information is therefore stored in the form of probability densities given a connected component shape. The connected components do not need to correspond to characters; they might be ligatures or frequently touching character pairs like “oo” or “as”. To measure the baseline and x-height of a new text-line, the connected components are extracted from the text-line and the associated probability densities for the baseline and x-height locations are retrieved. These densities are then mapped and locally averaged across the entire line, resulting in a probability map for the baseline and x-height across the entire text-line. Maps of x-height and baseline of an example text-line (Figure B.2-(a)) are shown in Figure B.2-(b) and (c) respectively. The resulting densities are then fitted with curves and are used as the baseline and x-height for line size normalization. In line size normalization (possibly curved) baseline and x-height lines are mapped to two straight lines in a fixed size output text-line image, with the pixels in between them rescaled using spline transformation. This method of normalization of a text-line has been used in the experiments for English and Fraktur. B.4 Filter-based Normalization The zone-based and token-dictionary methods work satisfactorily for scripts, where either baselines and x-height information is easily estimated or where segmentation can be done to extract individual characters. They fail to perform reasonably for Urdu Nastaleeq script where neither baseline nor segmentation are trivial to estimate. The filter-based normalization method is independent of estimating baseline or individual characters. This method is based on simple filter operations and affine transformation; thus making it script-independent normalization method, as compared to the normalization process described in the previous section, which was based on the shapes of the Latin alphabets. The complete normalization process is shown in Figure B.3. The input text-line image is first inverted and smoothed with a large Gaussian filter. The benefit of doing this is to capture the global structure of the underlying contents. Now, as shown in Figure B.3-(a), the smoothed image has maximum values near the center of the image along the vertical axis. These points are then fitted with a straight line (in practice, we smooth the line passing through these points as well). This is the line around which the whole text line is re-scaled using affine transformation. First a zone is found according to the difference between the height of the input image and the center line. Now, to make sure that the finally normalized image contains all the contents without clipping, the next step is to expand the image above and below of the center line by the amount equal to the height of the image. This padded image is then cropped using the zone measurement found previously. Finally, the image is scaled to the required height using affine transformation. The width of the final image is calculated by multiplying the original width with the ratio of “target” height to the height of the dewarped image. The only tunable parameter in this method is the target height. Other parameters are calculated from the given image itself. This text-line normalization is used for works reported for Urdu Nastaleeq, historical Latin and for multilingual documents. Some of the normalized images using this methodology are shown in Figure B.4.