wanghaisheng / awesome-ocr

A curated list of promising OCR resources
http://wanghaisheng.github.io/ocr-arxiv-daily/
MIT License
1.66k stars 351 forks source link

Adnan Ul-Hasan的博士论文-第六章 历史文档的 OCR #82

Closed wanghaisheng closed 5 years ago

wanghaisheng commented 6 years ago

Preserving the literary heritage is important to gain valuable insights into human history and to gain knowledge about the different aspects of our ancestors’ lives. Documents, whether written on leaves, stones, textiles, animal skin, or paper, have remained the eyes to the history of mankind. Old documents are very delicate and they need extreme care. With the advancement in the capturing technology, the cost of preserving old documents has decreased many folds. Libraries and institutes around the globe have made valuable efforts in making digital copies of historical documents. However, navigating through these documents is still very difficult as the scanned images are not suitable for searching and indexing. Automatic recognition of historical documents can help a Paleographer1 by indexing the old documents in a digital library. But, one must contemplate that this is very challenging due to the following reasons: • These documents are usually found in a very bad condition, so the quality of text is not very high, and is often highly degraded with torn pages. • The document may have been written in an ancient script with archaic orthography. These primitive scripts render these documents very difficult to recognize by computer programs. • Many modern text recognition algorithms, based on supervised ML, require a lot of transcribed training data. Transcribing documents by a human is very laborious and costly due to the involvement of language experts; therefore, it is not feasible to transcribe a large enough data for historical scripts. 1Paleography is the study and scholarly interpretation of ancient writings and its various forms. This chapter contributes in the following two ways to enhance the research in the field of historical document digitization. • Firstly, LSTM-based OCR methodology, which has been described in the previous chapter, is successfully applied on Frakur and Polytonic Greek scripts. On both these scripts, the LSTM networks have been trained using the synthetically generated data and the results on scanned documents have shown that the LSTM-based methodology has outperformed both Tesseract and ABBYY OCR systems. • Secondly, a novel approach has been proposed to OCR scripts that lack large amounts of training data. This approach combines the segmentation-based and segmentation-free OCR paradigms to achieve this goal. Although meagre GT data was available, the proposed approach has shown to produce excellent OCR results without using any language models or any other technique. The first part of this chapter, Section 6.1 reports the results of applying LSTMbased OCR to Fraktur and Polytonic Greek scripts. The second half, Section 6.2 describes a novel framework to combine segmentation-based and segmentation-free approaches for 15th century Latin script. 6.1 Fraktur and Polytonic Greek Scripts Fraktur script, also knows as Black-Letter or Gothic, has been the script of writing in central Europe, especially in Germany from 16th to 20th century. Polytonic Greek , on the other hand, remained the main script in Greece till the 1980s. Rich literature heritage of both countries is available in these scripts, and OCR research is actively being utilized under various digitizing projects. This section describes the results of experimental evaluation of LSTM-based OCR methodology (see Section 5.1) for Fraktur and Polytonic Greek scripts. 6.1.1 Related Work There is not much literature available regarding the digitization of Fraktur and Polytonic Greek scripts. ABBYY2 is providing support for Fraktur OCR for a long time. Its core OCR module is based on segmentation-based approach [Fuc] where a combination of classifiers recognize a single letter. Furrer and Volk [FV11] improved both the

character and word error rates using language modeling. Tesseract [Tes14] also provides the OCR capability for Fraktur script. White [Whi13] adapted Tessereact OCR system and proposed many improvements to better train it for Polytonic Greek script recently. They showed how to incorporate language-related hints to improve the performance. Boschetti et al. [BRB+09] compared three OCR engines including ABBYY FineReader 9.0, OCRopus 0.3 (based on Tesseract) and Anagnostis 4.1 for Polytonic Greek documents. They aligned the outputs of three OCR systems using progressive multiple alignment method to improve their results. Gatos et al. [GLS11] proposed a Polytonic Greek recognition system consisting of five modules. One module was dedicated to accent recognition, while the remaining four modules recognized character belonging to various horizontal zones. At the first step, accents were identified and separated. The remaining characters were segmented using skeleton features of foreground and background pixels. The segmented characters were assigned to one of four zones identified by the baseline and the mean-line. In this manner, characters were divided into four categories. A normalization step is carried out on all five categories (modules) before using k-NN classifier, with k=3. They reported a character recognition accuracy of 90% on various Polytonic Greek documents that belonged to the period between 1950 to 1965. 6.1.2 Database Both Fraktur and Polytonic Greek lack any standardized datasets to evaluate the performance of OCR algorithms. For the OCR of Fraktur script, the LSTM training is carried out on synthetically generated 20,000 text-lines collected from various Fraktur sources. And for ancient Greek, the training is done using a combination of scanned and artificial data from Polyton-DB (see Section 4.4.2). To evaluate the performance of LSTM models on Fraktur, some sample images are taken from two scanned books, namely (i) Theodor Fontane Wanderungen durch die Mark Brandenburg (a clean, high resolution scan), and (ii) the Ersch-Gruber encyclopedia (a noisy, lower resolution scan). The LSTM-base OCR models for old Greek are evaluated on various combinations of datasets. These combinations are listed in Section 6.1.4. 6.1.3 Experimental Evaluation for Fraktur Script Tests have been performed on two scanned documents of Fraktur script. Furthermore, text in Antiqua (Latin) font was excluded from the evaluation. Since error rates were so low, we could quickly determine error rates and Ground-Truth (GT) data using a spell checker and verify any flagged words against the source image. The text contains few digits and little punctuation marks which yields good error estimates. On randomly selected pages from Fontane representing 8,988 Fraktur characters, the CER is 0.15%. On Ersch-Gruber, the CER is 1.37% on randomly selected page images representing 10,881 Fraktur characters. These results are without a language model and without adaptation to the fonts found in these documents. Recognition results for Fraktur (for both Fontane and Ersch-Gruber) are also compared with other OCR systems (see Figure 6.1). The Tesseract system applied to these inputs yields CER of 0.898% (Fontane) and 1.47% (Ersch-Gruber), using a German dictionary and font adaptations. ABBYY commercial OCR system outputs CER of 1.23% on Fontane and 0.85% on Ersch-Gruber. 6.1.4 Experimental Evaluation for Polytonic Greek Script In order to evaluate the LSTM-based line recognizer, three types of experiments have been carried out using different combinations of documents in Polyton-DB. Like Fraktur, the results have also been compared with the Tesseract and ABBYY FineReader In the first experiment, synthetic data of Appian’s Roman history and the 687 images of the Greek Official Government Gazette have been used to train the LSTMengine (on a total of 12,486 text-lines), while the text-lines of the Greek Parliament Proceedings are used as the test set (3,303 text-line images). It is important to note that in this setting the fonts in the training set are different from the four fonts of the test set. The LSTM recognizer yields a CER of 14.68%, after 125,000 iterations in the training phase, as detailed in Figure 6.2. In the second configuration, the training set includes the synthetic data of Appian’s Roman history, the text-line images of the Greek Official Government Gazette, and the text-line images of the three subsets of the Greek Parliament Proceedings (a total of 15,167 text-lines), while the text-lines of one subset of each of the abovementioned sources are the test images (522 text-lines). In this experiment, the training data contain text written in five different fonts, while the test set includes one font that is unseen during the training. A CER of 5.67% is observed on the test data (see Figure 6.3). In the last experiment, LSTM line recognizer is compared with the Tesseract and ABBYY OCR systems. For Tesseract, the training model for Greek polytonic script proposed by White [Whi13] is used. Regarding the ABBYY FineReader engine, we adapted it to the recognition of Greek polytonic scripts by adopting the procedure described in [SUHP+15]. The 367 text-lines from Greek Official Government Gazette and the datasets of Appian’s Roman history comprise the training data for the LSTM-based recognizer, while the remaining 2,836 text-lines of Greek Parliament Proceedings were included in the test data. The results are presented in Figure 6.4. It is worth mentioning that the poor performance of Tesseract is mainly explained by the fact that the characters’ degradation in the test set is too high, and the character segmentation introduces too many mistakes that are propagated in the recognition stage. The results of Tesseract further strengthen the observation that many modern-day OCR system do not generalize well on unseen data, which may be totally different to that was used during the training. Regarding the LSTM-based recognizer, the training model with the lowest error rate (0.16%) is produced after 138,000th iterations. This training model gives a CER of 6.04%. The total number of characters in the test set is 169,568. By using the training model produced after 148,000th iterations, with corresponding CER of 0.35%, results in reducing the CER on the test set from 6.05% to 5.51%. The most frequent errors for the LSTM-recognizer are illustrated in Table 6.1. In

particular, there are 318 deletion errors and 273 insertion errors out of 9,351 errors in total. Furthermore, there is a great number of errors where a letter is misclassified with the same letter but different accent. For example 94 occurrences of the letter Ἐ are erroneously classified as the letter Ἑ. 6.1.5 Conclusion The results obtained on both Fraktur and ancient Greek demonstrate that the LSTMbased approaches generalize much better to unseen data than the previous ML approaches. In addition, during LSTM training, very low error rates on test data are observed, often long before one epoch of training has been completed, meaning that there has likely been no opportunity to “overtrain”. These results also show that training from artificially generated data is highly successful. One does, however, need to take care of generating this artificial data to resemble well with the scanned data, or else, the performance will suffer. However, to synthetically generate the training data in a particular script, one requirement is to possess some text in that script. This requirement, however, can not be fulfilled for ancient scripts where no such data is available. The next section reports the contribution of this thesis in dealing with the scarcity of GT training data for historical documents.

wanghaisheng commented 6 years ago

6.2 OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters As mentioned previously that the contemporary ML approaches require a lot of transcribed training data in order to obtain satisfactory results. Transcribing the documents manually is a laborious and costly task, requiring many human hours and language specific expertise. This section presents a generic iterative training framework, named as OCRoRACT 3, to address this issue. The proposed framework is not only applicable to historical documents but also suitable for present-day documents, where manually transcribed training data is unavailable. Starting with the minimal information available, the proposed approach iteratively corrects the training and generalization errors. Specifically, a segmentation-based OCR method has been trained on individual symbols and used to OCR a subset of documents. These semi-corrected text-lines are then used as the GT data to train a segmentation-free OCR system, which learns to correct the errors by incorporating contextual information. The proposed framework is tested on a collection of 15th century Latin documents with promising success. The iterative procedure using segmentation-free OCR is able to reduce the initial character error of about 23% (obtained from segmentation-based OCR) to less than 7% in few iterations. 6.2.1 Introduction There could be multiple approaches to OCR documents, which can be either historical or modern, for which the training data is not available. The first intuitive idea to deal with the unavailability of data is to use a segmentation-based OCR approach that relies on character segmentation from a scanned image, and does not require much training data. The only requirement in such approaches is to extract unique characters from the whole document and train a shape-based classifier [AHN+14, Whi13]. The next possible approach would be to manually transcribe a part of data that needs to be recognized, to train a classifier based on segmentation-free OCR approach (that are better for context-aware recognition), and then use this model to OCR the rest of the corpus. However, both of the above-mentioned solutions carry their own demerits. The first approach, in practice, does not generalize well for new documents as shown by

the performance of Tesseract default model for Polytonic Greek on Polyton-DB in Section 6.1.4. The second approach requires a large amount of transcribed data for every type of document that is to be digitized. The use of artificial data is also getting popular; however, generating such a database requires some already transcribed data. Our hypothesis is that the segmentation-based approaches can be used in tandem with the segmentation-free approaches to OCR documents where no or very limited training data is available. Specifically, the proposed approach is designed for 15th century printed Latin documents, for which very small amount of training data is available. However, the proposed framework can be equally applied to other situations, where no training data is available. The proposed framework successfully combines the benefits of both approaches. The first issue in using a segmentation-free approach is to have a manually transcribed data. This problem is solved by using a segmentation-based approach to generate semi-corrected GT data. The second issue in using a segmentation-based approach is the poor generalization on new data. LSTM networks aptly solve this problem, as they have demonstrated excellent results on unseen data [BUHAAS13]. The recognition error is reduced with each iteration of training using the proposed design. The CER is reduced to 6.569% from 23.64% in just three iterations. Section 6.2.2 describes the details of this framework. 6.2.2 Methodology The novelty of the proposed idea is to use both segmentation-based and segmentation-free OCR approaches in tandem to design a high performance OCR system for documents having no or very limited GT data. The complete pipeline of OCRoRACT framework is shown in Figure 6.5. The idea is to use the segmentation-based OCR to start the training on individual symbols, as not much training data is usually required for such systems. The OCR models obtained can be used to get a semi-corrected text, which can be used subsequently to train a segmentation-free OCR system. The process starts with the extraction of individual symbols from scanned pages or text-lines. A language expert can provide a list of unique symbols in a given document along with their Unicode representation. Alternatively, a clustering process can follow the symbol extraction step to find the unique symbols. In the present work, the former path is taken due to the availability of some language experts. Tesseract [Smi07] is trained to recognize text, based on the given symbols. Tesseract OCR model is then used to generate the semi-corrected GT information, which is subsequently used to train OCRopus [OCR15], a segmentation-free open-source OCR system based on LSTM neural networks. The trained LSTM model is then used again to improve the GT information. This iterative process can be repeated for any number of iterations; however, in the current work, we chose to stop the training after seeing a small improvement (less than 1%) in the GT data from the previous iteration. The details of LSTM-based OCR approach are presented in Chapter 5, and the details about the Tesseract can be found in [Smi07]. However, it is important to mention the procedure for training the Tesseract, which requires a specific box-file containing the bounding-box information of individual connected components in the image along with their corresponding Unicode representation. It is also required to provide this information in the form of a page, so that the Tesseract can learn the context during the training. As it has been mentioned before, that transcribed data for the medieval documents is not available; therefore, a meaningless text is generated from the unique symbols, that is, characters are placed in a text-line randomly. However, the characters are placed according to the placement rules with respect to baseline and x-height information. A thorough analysis has been carried out to find the statistics about various characters and the blank spaces between the individual characters and words. The basic concept behind the proposed framework is to jointly use segmentationbased and segmentation-free approaches to OCR documents for which GT data is unavailable to train ML algorithms. There are three ways in which the proposed methodology has been evaluated. This is done to compare the results of various alternatives as mentioned in Section 6.2.1. Firstly, the Tesseract is used to OCR the database described in Section 7.5.2. Secondly, an LSTM network is trained directly on a subset of dataset that has been transcribed manually. Thirdly, the proposed approach is used to iteratively improve the semi-corrected GT data to train LSTM networks. 6.2.3 The Database In order to validate our research hypothesis, 15th century Latin documents, available under the German government funded project, Kallimachos4, are used. There are hundreds of novels written in Latin and Fraktur that are to be digitized. Firstly, it is also important to remember that no GT data is available for any of these novels. Secondly, the pages in these novels not only contain degradations because of aging but also contain annotations, which make them more challenging for document analysis. In order to validate our research hypothesis, we have selected a subset of 100 images from one novel of Latin script for training, as well as 8 images from another Latin novel for testing. For performance evaluation, GT for 100 scanned pages with 3329 text-lines have been manually generated. 6.2.4 The System Parameters The system parameters for Tesseract are kept to their default values. There are some tunable parameters to train LSTM-based OCR models. The first important parameter is the number of hidden layers, which is chosen to be one. The second important parameter is the number of LSTM cells at the hidden layer. Higher number of cells take longer to converge due to the enormous size of the network. For the experimental results reported in this section, the number of LSTM at the hidden layer is fixed to 100. Normalizing text-line images to a fixed height is an important preprocessing step. In our experience, the image height of 48 pixels is a reasonable choice and has worked satisfactorily for a variety of scripts. The learning rate and momentum are the remaining parameters and they are set to 1e-4 and 0.9 in the current work. The performance is estimated in terms of Character Error Rate (CER) which is defined by Equation 5.1. 6.2.5 Results This section reports the results of the three experiments that are performed to test the performance of the proposed framework. There are two test datasets used in the evaluation process. The first dataset consists of two randomly selected pages from the same book that is used for training the Tesseract and OCRopus OCR systems; however, these pages are not part of any training. The total number of text-lines are 104 in these two pages and the total number of characters are 2877. This dataset is termed as ‘T1’. The second dataset consists of 8 pages randomly selected from a book that has not been used in training at all. There are a total of 270 text-lines in these pages and the total number of characters are 7203 This dataset is termed as ‘T2’. The intermediate results obtained by the OCRoRACT framework are presented first before comparing the final results with other two alternatives. During the first stage (iteration-0), a Tesseract OCR model is trained on the unique symbols extracted from the given training documents. Tesseract yields a CER of 23.64% on ‘T1’ dataset and a CER of 14.4% on the data collection that is later used as the training database for LSTM-based OCR. This means that the LSTM-based OCR has been trained with a 14.4% erroneous GT data (iteration-1). The LSTM model thus trained outputs a CER of 7.37% on ‘T1’ database. The same model improves the GT data from having an error of 14.4% to 7.154% (an improvement of 50.1%). During the second iteration (iteration-2), the LSTM models are trained with the improved GT obtained from the first iteration. The model thus trained gives a CER of 7.26% on the ‘T1’ dataset, and improves the GT by further 9.56% (CER of 6.47%). The improved GT is used again (iteration-3) to train another LSTM model in third iteration, which results in a CER of 6.57% on ‘T1’ data and improves the GT by 0.9%. This iterative process could go on further, but we decided to stop it at this stage as the GT improvement is now less than 1%. The LSTM models obtained after the third iteration are used to OCR the ‘T2’ dataset along with Tesseract (that was trained at the beginning of this process) and the OCRopus model, trained with correct GT information. The results of the three evaluations are listed in Table 6.2. A qualitative comparison is also performed and some of the input images along with the OCR output are shown in Table 6.3. 6.2.6 Error Analysis and Discussions The top confusions in applying the three OCR systems on ‘T1’ dataset are shown in Table 6.4. Tesseract has mostly made errors in recognizing between small and capital The OCRoRACT system, which has been trained on the erroneous GT data corrects many of these errors. However, it confuses between similar shape characters like ‘¯a’ and ‘a’ or ‘¯u’ and ‘u’. It is to be noted that both of these systems yield many insertion and deletion errors related to the ‘space’ character. One way to reduce the errors made by the OCRoRACT system is to retrain Tesseract system with more variety of symbols that are causing confusions. The OCRopus system produces less errors; however, the top confusions are again ‘space’ insertion and deletion. Other notable errors are deletion errors. The deletion errors can generally be reduced by better training. It must be noted that the training data is not much and the performance of LSTM networks in still quite satisfactory. It is also important to see the errors obtained from applying these three systems on ‘T2’ dataset, which has not been used during the training. The top confusions are listed in Table 6.5. By comparing the outputs of three systems on ‘T2’ dataset from 6.2.7 Conclusion This section presents a novel framework, the OCRoRACT, to combine the benefits of both segmentation-based and segmentation-free OCR approaches to OCR documents where no training data is available. Moreover, the reliance on a human expert to generate the transcribed data is greatly reduced, if not eliminated by using the proposed methodology. The performance of this system is excellent when a document with similar script is evaluated. Moreover, since the system does not require accurate GT information, it can be retrained as new output becomes available. This framework can be improved further by incorporating the ability of the system to be used for cursive scripts having huge amounts of ligatures.

wanghaisheng commented 6 years ago

6.3 Chapter Summary This chapter discusses the contribution of this thesis to OCR of historical documents. LSTM-based OCR methodology has successfully been applied to three historical European scripts, Fraktur, Polytonic Greek and Medieval Latin of 15th century. Creating a manual GT data for LSTM-based approaches is tedious and costly in terms of manual efforts and language expertise. The use of synthetic data for LSTM training has been very successful to achieve reasonable accuracy on Fraktur and Greek; however, to generate the artificial data, one must have already transcribed text. To overcome the challenge of unavailability of GT data, a novel framework, OCRoRACT, has been introduced combining the segmentation-based and segmentation-free OCR approaches. Segmentation-based approaches require individual characters for training; however, they do not generalize well in practice on new data. This limitation is overcome in the proposed framework by training the LSTM networks on the semi-corrected GT data generated by the segmentation-based approach.