Adnan Ul-Hasan的博士论文-第八章多种文字文档的通用 OCR 架构

Multilingual documents are common in the computer age of today. Plethora of these documents exist in the form of translations, books, operational manuals, etc. The abundance of these multilingual documents in everyday life is observed today due to two main reasons. Firstly, technological advancements are reaching in each and every corner of the world due to globalization, and there is an increasing need from the international customers to access the technology in their native language. This phenomenon has a two-fold impact: 1) operational manuals of electronic gadgets are required to be in multiple languages, 2) the access to knowledge available in other languages has become very easy; thereby, an increase in bilingual books and dictionaries has been witnessed. Secondly, English has become an international language, and the effect of this internationalization is evident by its impact on many languages. Several languages have adopted words from English and various documents, for instance newspapers, magazines, and articles, use many English words on a daily basis. Therefore, the need to develop reliable Multilingual OCR (MOCR) systems to digitize these documents has inflated manifold. Despite the increase in the availability of multilingual documents, automatic recognition of multilingual text remains a challenge. Popat [Pop12] pointed out several challenges in the context of the Google books project1. Some of these unique challenges are: • Multiple scripts/languages on a single page. • Multiple languages in same or similar scripts, like Arabic-Persian, English- German. • The same language in multiple scripts, like Urdu in Nastaleeq and Naskh scripts. • Archaic and reformed orthographies, for example, 18th Century English, Fraktur (historical German). One solution to handle multilingual documents is to develop an OCR methodology that can recognize all characters of all scripts. However, it is commonly believed that such a generic OCR framework would be very difficult to realize [PD14]. The alternate process (as shown in Figure 8.1) is to employ a script identification step before recognizing the text. This step separates various scripts present in a document, so that a unilingual OCR model can be applied to recognize each script. This procedure, however, is unsatisfactory for many reasons, some of which are listed below: • The script identification is itself quite a challenging feat. Traditionally, it involves finding suitable features of the given script(s). One has to either fine tune these hand-crafted features or has to look for some other features, if the same script identification methodology has to be used for other scripts. • The process of script identification (see chapter 7) is not perfect, thereby the scripts recognized by such process can not be separated reliably. This directly affects the recognition accuracy of the OCR system employed. • Moreover, humans do not process the multilingual documents using the script identification step. A person possessing multilingual prowess reads a multilingual document in a similar manner as he/she would read a monolingual document. Hence the ultimate aim to OCR multilingual documents is to develop a generalized OCR system that can recognize all scripts. An MOCR system must be able to handle various scripts as well as it should be robust against the intraclass variations, that is, it should be able to recognize the letters despite slight variations in their shapes and sizes. Although the idea of generalized OCR system is not new, it has not been pursued greatly because of lack of computational powers and suitable algorithms to recognize all characters of multiple scripts. However, recent advancement in machine learning and pattern recognition fields have shown great promise on many tasks that were once considered very difficult. Moreover, these learning strategies are claimed to mimic the neural networks employed in the human brain. So they should be able to replicate the human capabilities in a better way than other neural networks. The main contribution of this chapter is a Generalized OCR framework2 that can be used to OCR multilingual and multiscript documents such that there is no need to employ the traditional script identification step. A sub-goal of this work is to highlight the discriminating power and sequence learning capability of LSTM networks for a large number of classes for OCR tasks. The trained LSTM networks can successfully discriminate hundreds of classes when it is trained for multiple scripts/languages simultaneously. The rest of this chapter is organized as follows. Section 8.1 reports the work done by other researchers to develop generalized OCR systems for multilingual documents. Our quest for a generalized OCR system starts with the development of a single OCR model that can recognize multilingual text in which all languages belong to a single script. Section 8.2 discusses the cross-language performance of LSTM networks. The next step of our quest is to extend the idea of “single OCR model” from multilingual documents to multiscript documents. A single OCR model that can recognize text in multiple scripts is the first step in realizing a generalized OCR system. Section 8.3 describes the design of LSTM-based generalized OCR framework in detail. Section 8.4 concludes the chapter with a brief summary and outlines some directions in which the present work can be further extended.

8.1 Traditional Approaches for MOCR The usual approach to address the MOCR problem is to somehow combine two or more separate classifiers [OHBA11]. This is because of the common belief that a reasonable OCR output for a single script can not be obtained without sophisticated post-processing steps such as language modeling, use of dictionary to correct OCR errors, font adaptation, etc. Natarajan et al. [NLS+01] proposed an HMM-based scriptindependent MOCR system. Feature extraction, training and recognition components of this system are all language independent; however, they used language specific word lexicon and language models for the recognition purpose. There have been efforts reported for the adaptation of the existing OCR systems to other languages. Open source OCR system Tesseract [SAL09] is one such example. The recognition of characters in Tesseract is based on hierarchical shape classification. The character set is reduced to few basic characters and then at last stage, the test sample is matched against the representative of the reduced set. Although, Tesseract can be used for a variety of languages, it can not be used as an all-in-one solution in situations where multiple scripts are present in a single document together. Similar to the Tesseract OCR, BBN BYBLOS system [LBK+98] can be trained for multiple languages; however, this system is also not capable of recognizing multiple languages and scripts simultaneously. To the best of our knowledge, not a single method has been proposed for MOCR, that can achieve very low error rates without using sophisticated post-processing techniques. However, experiments on many scripts using LSTM networks have demonstrated that significant OCR results can still be obtained without such techniques. The details about the LSTM-based language independent OCR framework are presented in the next section. 8.2 Language Independent OCR with LSTM Networks Language models or recognition dictionaries are usually considered an essential step in OCR. However, using a language model complicates the training of OCR systems and it also narrows the range of texts that an OCR system can be used with. However, recent results have shown that LSTM-based OCR yields low error rates even without language modeling. This leads us to explore the question as to what extent LSTM models can be used for MOCR without the use of language models. To this end, we measure3 cross-language performance of LSTM models trained on different languages. They have exhibited a great capability to be used for language independent OCR. The recognition errors are very low (around 1%) without using any language model or dictionary correction. Our hypothesis for language independent OCR is that if a single model can be obtained for a single script which is common to many languages, e.g. Latin, Arabic, Devanagari, we can then use this single model to recognize text of that particular family. Doing so, the efforts to combine multiple classifiers can be reduced. The basic aim in this work is to benchmark how LSTM networks use language modeling to predict the correct labels or can they do better without using any language modeling and other post-processing step. Additionally, we also want to see how well LSTM networks use the contextual information to recognize a particular character. 8.2.1 Experiment Setup To explore the cross-language performance of LSTM networks, a number of experiments have been performed. We have trained four separate LSTM networks for English, German, French and a Mixed-Data of all these languages. For testing, there are a total of 16 permutations. Each of the four aforementioned LSTM model is tested on the respective language and on the other three languages as well, for example, testing LSTM network trained on German language on a separate German corpus, French, English, and Mixed-Data. These results are detailed in Section 8.2.5. As an error metric, the ratio of insertions, deletions and substitutions relative to the GT (CER) has been used and accuracy is measured at the character level. This error metric is termed as ‘Levenshtein Distance’ in the literature and is given by Equation 5.1. This section is further organized as follows. The next sub-section describes the binarization and the text-line normalization, which are the first step in the LSTM-based approach. Details on preparing the dataset for training the LSTM models and the dataset for evaluation are given next. After the details on the database, LSTM network parameters are given, while the results are presented at the tail end of this section 8.2.2 Preprocessing Binarization and text-line normalization form the preprocessing step in this experiment. Since synthetically generated text-lines are used in this work, binarization is carried out at the text-line generation step. However, text-line normalization is done separately. Scale and relative position of a character are important features in distinguishing characters in the Latin script (and some other scripts). Moreover, 1D-LSTM networks are not translation invariant in vertical direction. The text line normalization is therefore, an essential step in applying such networks. In this work, we have used the normalization approach introduced in [BUHAAS13] (see Appendix B for details), namely text-line normalization based on a trainable, shape-based model. A token dictionary created from a collection of text lines contains information about x-height, baseline, and shape of individual characters. These models are then used to normalize any text-line image. 8.2.3 Database To evaluate the proposed methodology, a separate synthetic database for each language is developed using the approach described in Chapter 4. Separate corpora of text-line images in German, English and French languages are generated with commonly used typefaces (including bold, italic, italic-bold variations) from freely available online literature. These images are degraded using some of the degradation models described in [Bai92] to reflect the scanning artifacts. Four degradation parameters namely elastic elongation, jitter, sensitivity, and threshold have been selected. Sample text-lines images from our database are shown in Figure 4.6. Each database is further divided into training and test subsets. Statistics on the number of text line images in training and test corpora of each script are given in Table 4.2. 8.2.4 LSTM Architecture and Parameters For the experiments carried out in this work, 1D-BLSTM architecture has been utilized. We have found that this architecture performs better than more complex LSTM architectures for printed OCR tasks (please refer to Appendix-A for further details about the LSTM networks and its different variants). 1D-LSTM networks require text-line images of a fixed height as they are not translation invariant in the vertical dimension. Therefore “normalization” is employed to make sure that the sequence depth remains consistent for all the inputs. The text lines are normalized to a height of 32 in the preprocessing step. Both leftto- right and right-to-left LSTM layers contain 100 LSTM memory blocks. The learning rate is set to 1e-4, and the momentum to 0.9. The training is carried out for one million steps (roughly corresponding to 10 epochs, given the size of the training set). Training errors are averaged after every 10,000 training steps. The network corresponding to the minimum training error is used for test set evaluation. While most of the other approaches use language modeling, font adaptation and dictionary corrections as means to improve their results, LSTM networks have shown to yield comparable results without employing these techniques. Therefore, it should be recognized that the reported results are obtained without the aid of any of the above-mentioned post-processing steps. Moreover, it should also be noted that no handcrafted features are used for the training of LSTM networks to recognize multilingual text. 8.2.5 Results The experimental results are listed in Table 8.1, and some sample outputs are presented in Table 8.2. Since, there are no umlauts (German) and accented (French) letters in English, therefore, the words containing those special characters are omitted from the recognition results while testing LSTM model trained for German on French and model trained for English on French and German. If such words are not removed, then the resulting errors would also contain a proportion of errors due to erroneous recognition of characters that were not present in the training of the LSTM model for that language. By removing them, the true performance of the LSTM network trained LSTM model trained for Mixed-Data is able to obtain similar recognition results (around 1% recognition error) when applied to English, German and French script individually. Other results indicate little language dependence in that LSTM models trained for a single language yielded lower error rates when tested on the same script than when they were evaluated on other scripts. To gauge the magnitude of affect of language modelling, we have compared our results with Tesseract (Version 3.02) open-source OCR system [Smi07]. The available models for English, French and German languages have been evaluated on the same test-data. Tesseract system yields very high error rates (CER) as compared to LSTM models. It seems that Tesseract models are not trained on certain fonts, thereby resulting in more recognition errors on these fonts. Tesseract OCR model for English yields 7.7%, 9.1% and 8.3% CER when applied to French, German and Mixed-Data respectively. OCR model for French returns 7.14%, 7.3% and 6.8% CER when applied to English, German and Mixed-Data respectively, while OCR model for German returns 7.2%, 8.59% and 7.4% recognition error when applied to English, French and Mixed- Data respectively. These results show that the absence of language modeling or applying different language models in Tesseract affects the recognition poorly. Since no model for Mixed-Data is available for Tesseract, the effect of evaluating such a model on individual script could not be computed. 8.2.6 Error Analysis The results reported in this work demonstrate that the LSTM networks can be used for MOCR. LSTM networks do not learn a particular language model internally (nor we need any such model as a post-processing step). Moreover, they show great promise to learn various shapes of a certain character in different fonts and under degradation (as evident from our highly versatile data). The language dependence is observable, but the affects are small as compared to other contemporary OCR methodologies, where absence of language models results in very bad results. To gauge the language dependence more precisely, one can evaluate the performance of LSTM networks by training them on randomly generated data using n-gram statistics and testing those models on natural languages. In the following text, we will analyze the errors produced by the LSTM networks when applied to other scripts. Top 5 confusions for each case are tabulated in Table 8.3. The case of applying an LSTM network to the same language for which it is trained is not discussed here as it is not relevant for the discussion of cross-language performance of LSTM networks. Most of the errors caused by LSTM network tra Most of the errors caused by LSTM network trained on Mixed-Data are due to its failure in recognizing certain characters like ‘l,t,r,i’. These errors may be removed by increasing the training data, that contains these characters in sufficient amount. Looking at the first column of Table 8.3 (Applying LSTM network trained for English on other 3 scripts), most of the errors are due to the confusion between characters of similar shapes, like ‘I’ to ‘l’ (and vice versa), ‘Z’ to ‘2’ and ‘c’ to ‘e’. Two confusions namely ‘Z’ with ‘A’ and ‘Z’ with ‘L’ are interesting as, apparently, there are no shape similarity between them. However, if the ‘Z’ gets noisy due to scanning artifacts, then it may look similar to a ‘L’. Another possibility of this error may be due to the fact that ‘Z’ is the least frequent letter in English4 and thus there may be not many ‘Zs’ For LSTM networks trained on German language (second column in Table 8.3), most of the top errors are due to the inability of LSTM to recognize a particular character. Top errors when applying LSTM network trained for French language on other scripts are shape-confusion between w/W with v/V. An interesting observation, which could be a possible reason for such behaviour, is that relative frequency of ‘v’ is higher than ‘w’ (see previous footnote) in German and English, while it is smaller in French. So, this is a language dependent issue, which is not observable in case of Mixed-Data. 8.2.7 Conclusion The application of LSTM networks for language independent OCR demonstrates that these networks are capable of learning many character shapes simultaneously. Therefore, they can be utilized to recognize multiple scripts simultaneously. The next section reports a generalized OCR framework in which LSTM networks have been used to recognize multiscript documents without the aid of a separate script identification module. LSTM networks have demonstrated great ability to be used as a universal OCR engine to recognize the text in multiple languages and scripts.

8.3 Generalized OCR with LSTM Networks Generalized OCR is the term used for an OCR system that can recognize text in multiple scripts and languages simultaneously. Encouraged by the promising OCR results obtained by the LSTM networks on language independent OCR results for Latin script, this section reports the extension of the same idea to recognize text comprising multiple scripts such that the traditionally employed script identification step can be avoided (see Figure 8.2). The proposed methodology for generalized OCR is essentially the same as that of using LSTM networks for a single script or unilingual OCR. The sequence learning methodology for a single script or unilingual OCR system, employs the training of LSTM networks on a large corpus of text-line images whose GT information is given. The GT information contains the character labels or equivalent encoding of a single script. An LSTM network is trained to learn the sequence-to-sequence mapping between the given text-line image and associated ground-truth sequence. In the proposed technique to OCR multilingual documents, the GT data contains the class labels representing the character-set of all scripts. LSTM networks have been used as a sequence learner on text-line images where the target labels are alphabets of multiple scripts. LSTM-based line recognizer learns the sequence-to-sequence matching between the multiscript target sequence with any given text-line image. Salient features of the proposed approach are as follows: • No handcrafted features are used; instead the LSTM network learns the features from the raw pixel values of the text-line images. • No post processing has been done to correct the OCR errors using language modelling, dictionary correction or other such operations. • Text is recognized at text-line level; thereby requiring only text-lines to be extracted from the layout step. Our hypothesis in this work is that the LSTM networks can be utilized to OCR multiple scripts using the single OCR model. This hypothesis is based on the results reported in the literature on the usage of LSTM networks for various sequence learning tasks. To justify our hypothesis, a single LSTM-based line recognizer is trained with a corpus containing multiple scripts (in our case, this corpus contains Latin and Greek scripts). To gauge the accuracy, standard metric of Levenshtein Distance is used (see Equation 5.1). The accuracy is measured at character level and reported as error rate. Experimental evaluation of the LSTM-based solution for generalized OCR are presented in the following section. 8.3.1 Preprocessing As mentioned earlier, normalization is an important preprocessing step in applying 1D-LSTM networks. The filter-based normalization method, as explained in Appendix B, is used for the experiments reported in this section. This method of textline normalization is script independent and has shown to work for both printed and handwritten text [YSBS15]. 8.3.2 Database One of the main issues in developing the MOCR technology is the unavailability of standardized databases. A lot of work has been reported for script identification on multilingual documents; however, as mentioned previously, the dataset used therein is either private or no longer available. Therefore, to evaluate our hypothesis about the generalized OCR, synthetically generated multilingual text-lines are used5. Though one can not replace the effectiveness of real data for training, LSTM networks, however have shown the capacity to tolerate small variations in shape images. In order to train LSTM networks, we have used 90,000 synthetically generated textline images in Serif and Sans-Serif fonts with normal, bold, italic and bold-italic styles (see Figure 8.3 for some example images). The process to generate artificial text-line images is explained in Chapter 4. Since, these text-lines are taken from natural documents, they contain the natural variation of scripts. Some text lines contain only one script and some contain a good distribution of words from multiple scripts. 8.3.3 LSTM Architecture and Parameters The architecture of LSTM-based line recognizer, used for Generalized OCR methodology, is shown in Figure 8.4. It is basically the same architecture that has been used throughout this thesis. The 1D-LSTM-based OCR system uses a small number of tunable parameters. One important parameter is the number of LSTM cells in the hidden layer(s) and the number of hidden layers. In this work, we used only one hidden layer with 100 LSTM memory cells in each of right-to-left and left-to-right layers (corresponding to bidirectional mode). Other parameters are learning rate (set to 1e−04) and the momentum (set to 0.9) in the reported experiments. These parameters are the same as that used in [BUHAAS13], because the performance of an LSTM line recognizer is fairly unaffected by the use of other numbers. The network has been trained for 1 million iterations (please refer to Figure 8.5 to see how training errors appear at each iteration) and the intermediate models were 8.3.4 Results To test the generalization capability of the trained LSTM network, 9,900 synthetically generated text-line images (using the same methodology described in Section 8.3.2) are used. The LSTM-based generalized OCR model yields an error rate of 1.28% on this test data. A sample image which is correctly recognized with the best trained model is shown in Figure 8.6 and an image on which the same model fails in predicting correct labels is shown in Figure 8.7. 8.3.5 Error Analysis Top confusions for OCR using the proposed generalized MOCR approach are tabulated in Table 8.4. It can be observed that many errors are due to the insertion or deletion of punctuation marks and ‘space’ deletions. This is understandable because of the small size of punctuation marks. Other source of errors are the confusion between the similar characters in both scripts. These letters such as o/􀀺, X/􀅜, O/􀅕 (first characters are Latin, while latter ones are Greek) are even indistinguishable to human eyes. But if one knows the context, it is easier to recognize any character. However, when these characters occur in conjunction with the punctuation marks, the recognition of a character becomes difficult because in this case context would not help LSTM networks to recognize. To elaborate this point further, consider Table 8.5 where confusions are shown considering neighboring context. There are many instances of similar characters accompanied with punctuation marks. These pairs makes the contextual processing of neural network difficult, resulting in substitution errors. However, it must be noted that these errors are due to the similarity of some characters in both scripts. Both scripts (English and Greek) share some characters’ shape (around 16), resulting in substitution errors. These errors will not be present in case of scripts that are markedly different in terms of their grapheme structure, e.g., English and Devanagari or Latin and Arabic.

This chapter validates the capability of LSTM networks for the OCR of multilingual (with multiple languages and scripts) documents. As a first step, a single LSTM model is trained with a mixture of three European languages of a single script, namely, English, German and French. The OCR model thus obtained produces very low errors in recognizing the text of these languages without using any post-processing techniques, such as language modeling or dictionary correction. The language dependence is observed by reduction in the character recognition accuracy as compared to the single language OCR; however, the effect is small in comparison to other OCR methods. The idea behind the language independent OCR is then extended to a generalized OCR framework that can OCR multilingual documents comprising multiple scripts. The presented methodology, by design, is very similar to that of a single script OCR system and does not employ the traditional script identification module. A single LSTM-based OCR model is trained for Latin-Greek bilingual documents that yields very low Character Error Rate (CER) on a dataset consisting of 9,900 text-lines. The results of our experiments underpin our claims that multiscript OCR can be done without doing separate script identification step, thereby, we can save the efforts spent on script identification and separation. The proposed system can be retrained for any new script by just specifying the character-set of that script during the training phase. Secondly, the OCR errors are mainly due to the similarity of both scripts. The algorithm could be tested further on other multilingual documents containing other languages and scripts, such as, Arabic and English, Devanagari and English, and many more.

Conclusions and Future Work This thesis contributes in the field of Optical Character Recognition (OCR) for printed documents by extending the use of contemporary Recurrent Neural Networks (RNN) in this domain. There has been an increasing demand in developing reliable OCR systems for complex modern scripts like Devanagari and Arabic, whose combined user population exceeds 500 million people around the globe. Likewise, large scale digitization efforts for historical documents require robust OCR systems to preserve the literary heritage of our world. Furthermore, an abundance of multilingual documents present today in various forms intensifies the need of having usable OCR systems that can handle multiple scripts and languages. This thesis contributes in two ways to solve some of these issues. Firstly, several datasets have been proposed to evaluate the performance of OCR systems for printed Devanagari and Polytonic Greek scripts. Databases for OCR tasks related to multilingual documents, such as script identification and cross-languages performance of an OCR system, have also been proposed. Additionally, a bilingual database to evaluate script independent OCR system has been developed. Secondly, Long Short-Term Memory (LSTM)-based OCR methodology has been assessed for some modern scripts including English, Devanagari, and Nastaleeq. This methodology has also been evaluated for some historical scripts including Fraktur, Polytonic Greek, and medieval Latin script of 15th century. A generalized OCR framework for documents containing multiple languages and scripts has also been put forward. This framework allows the use of a single OCR model for multiple scripts and languages. Several conclusions can be drawn from the work done in this thesis. • The first and the foremost is that the use of artificial data, if generated carefully to reflect closely the degradations in the scanning process, can replace the need for a large collection of real ground-truthed datasets. Experiments performed by training the LSTM-based line recognizer on synthetic data and testing it on real scanned documents justifies this claim. • The marriage of segmentation-based and segmentation-free approaches to OCR documents, for which GT data is not available, results in a framework that can self-correct the ground-truthed data in an iterative manner. • The powerful context-aware processing makes the LSTM-based OCR network a suitable sequence learning machine that can perform on mono-lingual scripts, as well as on multiple scripts. The use of 1D-LSTM networks requires very few parameters to tune and these networks outperform more complex MDLSTM networks, if the input is normalized properly. Moreover, the performance is better when features are learned automatically instead of when handcrafted features are used. There are multiple directions in which the work reported in this thesis can be further extended. Some of the key future directions are listed below: • The application of LSTM-based OCR methodology can be directly extended for the case of camera-captured documents. The challenge in camera-captured documents is the presence of curved text-lines. There are several techniques to correct this issue, including image dewarping or text-line extraction directly from the camera-capture documents. If these issues are taken care of, the application of LSTM-based OCR is straight forward and simple. • The Hierarchical Subsampling LSTM (HSLSTM) networks have shown excellent results on Urdu Nastaleeq OCR. However, their performance could be more thoroughly tested on other scripts to better gauge their potential. • The LSTM-based OCR reported for Urdu Nastaleeq can be extended further to other similar scripts, e.g., Persian, Pushto, Sindhi, and Kashmiri. • The OCR of Devanagari can be improved by employing a mechanism that addresses the issue of vertically stacked characters. MDLSTM networks may yield better results, or alternatively, improved preprocessing may benefit 1D-LSTM networks. • The generalized OCR framework is presently evaluated for two only scripts (Greek and English). If more datasets containing multiple scripts emerge, this framework can be used to further establish its performance. It is hoped that the work in this thesis successfully attempts to fulfill the gap that was present in the field of OCR for printed documents, and would serve as a stepping stone in future research endeavors.

wanghaisheng / awesome-ocr

Adnan Ul-Hasan的博士论文-第八章多种文字文档的通用 OCR 架构 #84

wanghaisheng / awesome-ocr

Adnan Ul-Hasan的博士论文-第八章 多种文字文档的通用 OCR 架构 #84

Adnan Ul-Hasan的博士论文-第八章多种文字文档的通用 OCR 架构 #84