wanghaisheng / awesome-ocr

A curated list of promising OCR resources
http://wanghaisheng.github.io/ocr-arxiv-daily/
MIT License
1.66k stars 351 forks source link

Adnan Ul-Hasan的博士论文-第四章 训练数据 #7

Closed wanghaisheng closed 7 years ago

wanghaisheng commented 8 years ago

Benchmark Datasets for OCR Numerous character recognition algorithms require sizable ground-truthed real- world data for training and benchmarking. The quantity and quality of training data directly a ects the generalization accuracy of a trainable OCR model. However, de- veloping GT data manually is overwhelmingly laborious, as it involves a lot of e ort to produce a reasonable database that covers all possible words of a language. Tran- scribing historical documents is even more gruelling as it requires language expertise in addition to manual labelling e orts. The increased human e orts give rise to - nancial aspects of developing such datasets and could restrict the development of large-scale annotated databases for the purpose of OCR. It has been pointed out in the previous chapter that scarcity of training data is one of the limiting factors in de- veloping reliable OCR systems for many historical as well as for some modern scripts. The challenge of limited training data has been overcome by the following contri- butions of this thesis: • Asemi-automatedmethodologytogeneratetheGTdatabaseforcursivescripts at ligature level has been proposed. This methodology can equally be applied to produce character-level GT data. Section 4.2 reports the speci cs of this method for cursive Nabataean scripts. • Synthetically generated text-line databases have been developed to enhance the OCR research. These datasets include a database for Devanagari script (Deva-DB), a subset of printed Polytonic Greek script (Polytonic-DB), and three datasets for Multilingual OCR (MOCR) tasks. Section 4.3 details this process and describes the ne points about these datasets. 4.1 Related Work There are basically two types of methodologies that have been proposed in the liter- ature. The rst is to extract identi able symbols from the document image and apply some clustering methods to create representative prototypes. These prototypes are then assigned text labels. The second approach is to synthesize the document images from the textual data. These images are degraded using various image defect models to re ect the scanning artifacts. These degradation models [Bai92] include resolution, blur, threshold, sensitivity, jitter, skew, size, baseline, and kerning. Some of these artifacts are discussed in Section 4.3 where they are used to generate text-line images from the text. The use of synthesized training data is increasing and there are many datasets re- ported in the literature using this methodology. One dataset that is prominent among these types is the Arabic Printed Text Images (APTI) database, which is proposed by Sli- mane et al. [SIK+09]. This database is synthetically generated covering ten di erent Arabic fonts and as many font-sizes (ranging from 6 to 24). It is generated from vari- ous Arabic sources and contains over 1 million words. The number increases to over 45 million words when rendered using ten fonts, four styles and ten font-sizes. Another example of a synthetic text-line image database is the Urdu Printed Text Images (UPTI) database, published by Sabbour and Shafait [SS13]. This dataset consists of over 10 thousand unique text-lines selected from various sources. Each text-line is rendered synthetically with various degradation parameters. Thus the actual size of the database is quite large. The database contains GT information at both text-line and ligature levels. The second approach in automating the process of generating an OCR database from scanned document images is to nd the alignment of the transcription of the text lines with the document image. Kanungo et al. [KH99] presented a method for generating character GT automatically for scanned documents. A document is rst created electronically using any typesetting system. It is then printed out and scanned. Next, the corresponding feature points from both versions of the same doc- ument are found and the parameters of the transformation are estimated. The ideal GT information is transformed accordingly using these estimates. An improvement in this method is proposed by Kim and Kanungo [KK02] by using an attributed branch- and-bound algorithm. Von Beusekom et al. [vBSB08] proposed a robust and pixel-accurate alignment method. In the rst step, the global transformation parameters are estimated in a similar manner as in [KK02]. In the second step, the adaptation of the smaller region is carried out. Pechwitz et al. [PMM+02] presented the IfN/ENIT database of handwritten Arabic names of cities along with their postal codes. A projection pro le method is used to extract words and the postal codes automatically. Moza ari et al. [MAM+08] devel- oped a similar database (IfN/Farsi-database) for handwritten Farsi (Persian) names of cities. Sagheer et al. [SHNS09] also proposed a similar methodology for generating an Urdu database for handwriting recognition. Vamvakas et al. [VGSP08] proposed that a character database for historical docu- ments may be constructed by choosing a small subset of images and then using char- acter segmentation and clustering techniques. This work is similar to our approach; however, the main di erence is the use of a di erent segmentation technique for Urdu ligatures and the utilization of a dissimilar clustering algorithm.

wanghaisheng commented 8 years ago

4.2 Semi-automated Database Generation This section describes an approach to automate the process of OCR database prepa- ration from scanned documents. It is believed that the proposed automation will greatly reduce the manual e orts required to develop OCR databases for cursive scripts. The basic idea is to apply ligature-clustering prior to manual labeling. Two pro- totype datasets for Urdu Nastaleeq script have been developed using the proposed technique. Urdu belongs to the family of cursive scripts where words mainly consist of liga- tures. Ligatures are formed by joining individual characters and the shape of a char- acter in a ligature depends on its position within the word. Moreover, there are dots and diacritics that are associated with certain characters. Each ligature in Urdu is sep- arated from other ligatures or its own diacritics by vertical, horizontal or diagonal (slanted) space. The properties of this script along with the challenges it poses for OCR have been described in Section 3.6. It is assumed in the context of the present section that the smallest unit of the script is a ligature, which may either be a combi- nation of many characters or a single non joinable character. There are around 26,000 ligatures in Urdu Nastaleeq script, and a reasonable database must cover all of them. The method to semi automatically generate a database for Urdu Nastaleeq starts with binarization as the pre-processing step. Urdu ligatures are then extracted from the text images. These ligatures are then clustered prior to manual labeling of the correct ligatures. The following sub-sections present a detailed view of the proposed method.

wanghaisheng commented 8 years ago

4.2.1 Preprocessing Binarization is the only preprocessing step in the proposed method; however, skew detection and correction may be included as further preprocessing steps. Local thresholding technique [SP00] is used for the binarization purpose. Fast implementa- tion of this algorithm proposed by Shafait et al. [SKB08] has been used to speed-up the process. Two parameters, namely local window size and k-parameter, are needed to be set empirically according to the documents. The local window size is taken to be 70 × 70 and k-parameter is set to 0.3. 4.2.2 Ligature Extraction Ligature extraction may be carried out in two ways: one is to apply ligature extrac- tion algorithm directly on the binarized image, while the second is to extract text- lines before applying ligature extraction. The former is suitable where documents are clean having well-de ned text-line spacing (see Figure 4.1-(a)) and the latter is suitable when text-lines are not separated very well in the documents (Figure 4.1-(b)), and in case of degraded historical documents. Narrow line separation results in poor connected component analysis; thereby, many ligatures from the text-lines above and below merge together. The decision to apply text-line segmentation is taken on the basis of line-spacing in a particular docu- ment. Ligature extraction is started by applying connected component analysis. The list of connected components is rst divided into two parts, base components and di- acritics (including dots). This division is based on connected component’s height, their width, and the ratio of the two. In the context of this chapter, font variations are not considered and the primary focus is to cover the typically used fonts in Urdu books and magazines. Therefore, thresholds for separating main ligatures and diacritics are set empirically on primary font’s size and they remain same for all the document images in our dataset. It is not possible to separate Urdu ligatures by a single threshold value. Therefore, di erent thresholds have been employed as per properties of a particular ligature. For ligature consisting of single ‘ ’ (alif ), the average height to width ratio is 4.0 and the average width of this ligature is around 6 pixels. For ligatures like ‘ ’ (bay), ‘ ’ (tay ) and ‘ ’ (say ), the average height to width ratio is 0.4 and the average width is around 30 pixels. For all other ligatures, it is su cient to employ a width greater than 10 pixels. If there are no diacritics in a ligature, e.g., in ‘ ’, then no further processing is needed. However, if one or more diacritics are present, e.g., in ‘ ’, then these diacritics must be associated to the base component to completely extract a ligature. Diacritics are searched in the neighborhood of a base component by extending the bounding box of the base connected component. This window size depends on the font size; but since we have used documents only with the dominant font size, this window is set according to that font size. Presently, the bounding box of the base component is extended by 15 pixels on top and bottom and by 10 pixels on right and left. Ligatures extracted in this manner are then saved to a database le for further clustering and labeling. 4.2.3 Clustering As it has been mentioned that due to the huge amount of ligatures that are present in a cursive script, the labeling of individual ligatures becomes highly impractical. Hence, it is proposed that the extracted ligatures be clustered according to similar shapes. For the purpose of clustering, the epsilon-net clustering technique is employed. By simply changing the value of epsilon, we can control the number of clusters. The value for epsilon is set empirically to get moderate amount of clusters, so that they can be managed easily at the manual step of validation. The features used for epsilon clustering are bit maps of the ligatures. Moreover, this method is relatively faster than the k-means clustering.

4.2.4 Ligature Labeling The next step is to verify the clusters and modify them manually, if needed. The OCRo- pus framework provides a pro cient graphical user interface to do this without much hassle. It is also possible that the clustering divides a single ligature in more than one cluster (see Figure 4.2-(a)), hence one needs to merge di erent clusters to save time at a latter stage of labeling. Moreover, one can also modify the step of dividing a clus- ter in a way to retain only valid members (same label as that of representative), assign null class to incorrect members and then apply further iterations of clustering on null class. In the current work, merging of same ligature clusters precedes the manual la- beling and only single cluster iteration is employed. After this veri cation step, each cluster is examined individually to identify invalid clusters, which are then discarded. Again OCRopus framework is used for this purpose (see Figure 4.2-(b)). At the end of this labeling process, we have a database whose entries indicate the following information about a ligature: • Image le name from where this ligature was originally extracted. • Bounding box information regarding the location of a ligature in the document image. • Unicode string corresponding to the character forming this ligature.

wanghaisheng commented 8 years ago

4.2.5 Experiments and Results This section describes the experimental setup and the evaluation of the results. Two prototype datasets for Urdu script have been developed using the proposed tech- nique. One dataset consists of clean documents such as that shown in Figure 2-(a). At present, only 20 such document images have been used. Here, this dataset is referred to as DB-I. The second dataset (referred to as DB-II) consists of 15 documents written by a calligrapher such as that shown in Figure 4.1-(b). An important property of cal- ligraphic documents is that the shape of a ligature does not remain identical within the document and minor di erences in their shapes may remain throughout the docu- ment. GT information about the DB-II is available which is used to gauge the accuracy of line segmentation algorithm. The importance of choosing these two datasets is to evaluate upper and lower bounds on the performance of the proposed algorithm. Performance evaluation metric used in the present work is ligature coverage, which refers to the number of ligatures in the dataset that are correctly labeled by the clus- tering step, followed by the manual validation step. The ligature extraction algorithm is able to nd 16,857 ligatures in DB-I database. The epsilon-net based clustering, then clusters these ligatures into 778 clusters. Each individual cluster is subsequently examined to verify the clustering and the Unicode values are assigned to new clusters. The invalid ligatures are discarded at this point. The ligature coverage achieved by this process is 82.3%. This high ligature coverage is due to su cient line spacing and non-touching ligatures. The inherent di culty with any connected-component analysis based method is the poor accuracy in case of overlapping lines and touching ligatures. To solve this problem in DB-II dataset, a line segmentation algorithm [BSB11] is employed. The segmentation accuracy of this algorithm is over 90%. The second problem of touching ligatures may be improved by using more sophisticated techniques. However, we are not interested in ne separation of individual ligatures as the errors may be corrected at latter manual labeling stage. Hence, we did not tackle this problem in this work. From DB-II database, a total of 18,914 ligatures are extracted. Then the cluster- ing of these ligatures resulted in 1,132 clusters. After the labeling process, the total ligature coverage is around 62.7%. Inconsistency in ligatures’ shape due to the hand- writing of the calligrapher results in poor clustering accuracy for DB-II dataset. In this case, simple shape-based clustering methods do not work su ciently and other methods are needed to be explored.

wanghaisheng commented 8 years ago

4.2.6 Conclusion A semi-automated methodology is proposed to generate a GT database from scanned documents for cursive scripts at ligature level. The same methodology can be used for rapid generation of character level datasets for other scripts as well. One unsatisfac- tory aspect of this methodology is the use of heuristics to extract ligatures from the document images. These heuristics need to be adapted accordingly for other scripts. It is also observed that the performance of this method is directly a ected by the choice of segmentation method and the quality of document images. The second approach to develop a large-scale GT OCR database is to use the image degradation models. This approach is described in the following section. 4.3 Synthetic Text-Line Generation The use of arti cial data is getting popular in computer vision domain for object recog- nition purpose. A similar path is taken in this thesis to address the issue of limited GT data. Baird [Bai92] proposed several degradation models to generate arti cial data from the text (ASCII) form. There are many parameters that can be altered to make the arti cially generated text-line images resemble closely to those obtained from a scanning process. Some of the signi cant parameters are: Blur: Itisthepixel-wisespreadintheoutputimage,andismodeledascircularGaus- sian lter. Threshold: It is used to distort the image by randomly removing the text pixels. If a pixel value is greater than this threshold, then it is a background pixel. Size: It is the height and width of individual characters in the image. It is modeled by image scaling operations. Skew: It is the rotation angle of the output symbol. The resulting angle is skewed to right or left by specifying the ‘skew’ parameter. In this thesis, a utility based on these degradation models from OCRopus [OCR15](open-source OCR framework) is used to generate the arti cial data. The aforesaid OCRopus utility requires utf-8-encoded text-lines to generate the corresponding text- line images along with the ttf -type font les. The process of line image generation is shown in Figure 4.3. The user can specify the parameter values or use the default values.

wanghaisheng commented 8 years ago

4.4 OCR Databases This section lists various datasets that have been developed using the synthetic gen- eration process as described in the previous section. These datasets are available freely for research purposes and can be obtained from the author. 4.4.1 Deva-DB – OCR Database for Devanagari Script A new database of Devanagari script is presented1 to advance the OCR research in Devanagari OCR. It can provide a common platform for researchers in this domain to benchmark their algorithms. This database, named Deva-DB consists of two parts. The rst part is the text-lines taken from the work of Setlur et al. [KNSG05]. The GT information is represented in transliteration form in this work. We have transcribed 621 text-lines manually in standard Unicode form. This data is used for evaluation purposes only, that is to test the LSTM model trained with the arti cial data. Second part of this database is the synthetically generated text-line images using OCRopus framework. These text-lines were chosen from various on-line resources covering the elds of current-a airs, religion, classical literature and science. Some sample images from this database are shown in Figure 4.4.

To check the quality of the training set, a comparison is made between the char- acter and word statistics of this set with that published by Chaudhuri et al. [CP97]. The ten most frequently used characters in Hindi based on three million Hindi words are shown in Table 4.1. We collected a similar statistic for our training data based on approximately one million words. As seen from Table 4.1, the relative frequency of the characters in the proposed training set is similar to that provided by Chaudhuri et al. [CP97] [CP97]. IIIT Hyderabad, India [III14] has published a word frequency from a Hindi language corpus containing three million words. The proposed training set also matches the top ten frequent words of that work.

wanghaisheng commented 8 years ago

4.4.2 Polyton-DB – Greek Polytonic Database A collection of some transcribed data is available under OldDocPro project [GSL+]2, for the recognition of machine-printed and handwritten polytonic documents. The big issue of large amount of transcribed data for training is overcome by the use of synthetic data (generated using OCRopus framework). The contribution of this thesis is the introduction of synthetic text-line images as part of the freely available Greek Polytonic Database, called Polyton-DB3, which includes printed polytonic Greek docu- ments from three main sources: GreekParliamentProceedings: Thisdatabase(asampleshowninFigure4.5-a)con- tains 3203 text-lines images, that are taken from 33 scanned pages of the Greek Parliament proceedings. The documents used in this database correspond to the speeches of Greek politicians of various eras in the 19th and the 20th cen- turies. Greek o cial government Gazette: These documents consists of 687 text-line im- ages (a sample shown in Figure 4.5-b), which are picked from ve scanned pages of Greek O cial Government Gazette. Appian’s Roman History: 315 documents from Appians’s Roman history (written in Greek language before AD 165) are used to create this database (a sample shown in Figure 4.5-c). It contains 11,799 text-line images. Appian’s Roman history scans are clean images which is not the usual case with historical doc- uments. Moreover, the writing style is di erent than the other two resources. To better train the OCR models for historical documents, synthetically degraded text-lines are generated from this corpus.

wanghaisheng commented 8 years ago

4.4.3 Databases for Multilingual OCR (MOCR) MOCR is a relatively new eld of research and there are not many publicly available databases to test the OCR algorithms in this regard. Moreover, the e orts till now have focused on script identi cation instead of complete processing of such docu- ments. Most of the reported works have utilized datasets that are either private or no longer available. There are three main contributions of this thesis to test the MOCR algorithms.

• A database to gauge the cross-language performance of any OCR algorithm. The text-lines in this corpus are generated synthetically using the procedure described in Section 4.3. This database consists of three European languages, namely English, German, French, and a mixed set of all these languages. This dataset consists of 370,799 text-line images (number of text-lines in each lan- guage is given in Table 4.2). Some sample text-lines from this versatile database are shown in Figure 4.6. • Alargeamountoftext-linesusedtomeasuretheperformanceofscriptidenti - cation methodology (reported in Chapter 7), have been generated arti cially. Di erent text corpora have been used to develop separate training and test data. There are 90,000 text-line images in the training set and 9,500 text-line images in the test set. The GT information is available in the form of both actual text and assigned class labels. • TovalidatethehypothesisofthegeneralizedOCRframework(reportedinChap- ter 8), a database from a English-Greek bilingual document has been created us- ing the synthetic text-line image protocol. A total of 90,000 text-line images for the training phase and 9,900 text-line images for the evaluation purposes have been generated.