Adnan Ul-Hasan的博士论文-第五章印刷体的OCR

In recent times, Machine Learning (ML) based algorithms have been able to achieve very promising results on many pattern recognition tasks, such as speech, handwriting, activity and gesture recognition. However, they have not been thoroughly evaluated to recognize printed text. Printed OCR is similar to other sequence learning tasks like speech and handwriting recognition and therefore it can also reap the benefits of high performing ML algorithms. Various challenges that are hampering the accomplishment of a robust OCR system have been discussed in Chapter 3. Upon looking at these challenges closely, one can realize that a human reader does not face many of these issues while reading a particular script. Human reading is powerful because of the ability to process the context of any text. Similarly, the internal feedback mechanism of Long Short-Term Memory (LSTM) networks enables them to process the context effectively; thereby rendering them highly suitable for text recognition tasks. This chapter discusses the use of LSTM networks for the OCR on three modern scripts. The first part of the chapter, Section 5.1, overviews the complete design of the LSTM-based OCR system. The second part, from Section 5.2 to Section 5.4, reports the experimental evaluations for modern English, Devanagari and Urdu Nastaleeq scripts. 5.1 Design of LSTM-Based OCR System This section provides necessary details about the LSTM-based OCR methodology that has been used for the experiments reported for modern English (Section 5.2), Devanagari (Section 5.3) and Urdu Nastaleeq (Section 5.4). The LSTM networks have been described in detail in Appendix A. For experiments reported in this chapter,

only 1D-LSTM networks have been utilized. MDLSTM architecture produced lower results in our preliminary experiments with printed English and Fraktur [BUHAAS13] and hence they are not considered. Some preliminary experiments on Urdu Nastaleeq script using Hierarchical Subsampling LSTM (HSLSTM) networks yield promising results; however, these networks have not yet been tested for other scripts. The complete process of using LSTM-based OCR system is shown in Figure 5.1. To use 1D-LSTM networks, the text-line image normalization is the only important preprocessing step. This is due to the fact that these networks are not translation invariant in vertical dimension, so this dimension has to be fixed prior to using these networks. Various normalization methods that have been used in this thesis, are described in Appendix B. There are few free parameters that are needed to be tuned in order to use 1D-LSTM networks and they are discussed in the following section. The features used for the LSTM networks are described in Section 5.1.2, while the performance metric is defined in Section 5.1.3

5.1.1 Network Parameters Selection There are four parameters, which require tuning in employing 1D-LSTM networks: the height of the text-line image, the hidden layer size, the learning rate, and the momentum. The input image height normalization depends upon the data. The specific values of these can be found in the relevant sections. However, momentum value is kept fixed at 0.9 for all the experiments. For the remaining two parameters (hidden layer size and learning rate), the suitable values are found empirically. An analysis has been carried to find the optimal hidden layer size and learning rate for Urdu Nastaleeq script. In summary, best parameters for hidden layer size and learning rate are found to be 100 and 0.0001 respectively. These values also match those reported by other researchers [Gra12, GLF+08] and they are also found to work well with other experiments reported in this thesis. Therefore, these values are kept fixed for all experiments in this chapter and for those reported in the rest of the thesis. To find the optimal number of hidden layers, learning rate and momentum are kept constant at 0.0001 and 0.9 respectively. Various LSTM networks have been trained with hidden layer comprising of of 20, 40, 60, 80, 100, 120, 140 and 160 LSTM cells. The comparison of respective recognition errors on test set as a function of hidden layer sizes is shown in Figure 5.2. The training time as a function of hidden layer sizes is shown in Figure 5.3. From Figure 5.2 and Figure 5.3, we can deduce that, firstly, increasing the number of hidden layer size decreases the recognition error, but at the same time, training for network with large size of hidden layer requires more time. Secondly, the increase in the training time is almost linear, while increase in the hidden layer size does not increase the recognition accuracy more than 5% when the hidden layer size is increased from 100 to 160. It is therefore decided to select 100 as the optimal hidden layer size for the present work. In the next step, fixing the hidden layer size to 100, the learning rate is varied among 0.001, 0.0001 and 0.00001. The comparison of respective recognition errors on test set is shown in Figure 5.4. It is evident from the figure, that a learning rate of 0.0001 is the most suitable choice.

5.1.2 Features The LSTM networks, like other modern machine learning methods, have shown to yield very good results on raw pixel values. No hand-crafted features are thus required and LSTM-based line recognizer learns the discriminating features from the input images itself. Yousefi et al. [YSBS15] demonstrated that 1D-LSTM networks with automatic feature learning can outperform MDLSTM networks and 1D-LSTM networks with hand-crafted features. Except the implicit features of baseline and x-height of individual characters, no other feature has been extracted for any work reported in this thesis.

5.1.3 Performance Metric The recognition accuracy is measured as “Character Error Rate (CER) (%)”, in terms of Levenshtein Distance (or more commonly known as the Edit Distance), which is the ratio between insertion, deletion and substitution errors and the total number of characters. CharacterErrorRate(CER) (%) = I + D + S Total Characters × 100 (5.1) where ‘I’ denotes the Insertion, ‘D’ denotes the Deletion and ‘S’ denotes the Substitution errors.

5.2 Printed English OCR Printed English1 has remained the main focus of the OCR research over many decades, and modern day OCR algorithm claim very low recognition errors. However, in practice, the generalization of such systems to new types of real-world documents is quite unsatisfactory. This happens mainly because of the nature of the OCR data, which means that the new text, on which an OCR system is applied, often differs significantly from all the training samples that that OCR system has seen during the training. There is currently no standard testing procedure or widely used dataset in the document analysis community to address this issue. The following sub-section summarizes the research work reported so far for printed English. Experimental evaluation is discussed after describing the related 1This section is based on the research work presented in [BUHAAS13]work. LSTM-based OCR achieves very low error rates on standard database. An error analysis is presented at the end of this section. 5.2.1 Related Work The OCR of printed English is very rich and an overview of decades of research in this field is prohibitively laborious. Mori et al. [MSY92] presented a historical review of OCR research and development, in 1992. They divided their review into two parts. In the first part, they described the history of OCR research, describing many template matching techniques and structural analysis methods. In the second part, they described various generations of commercial OCR products, starting from the early 1960’s systems with very limited functionality to the commercial systems available between 1980 to 1990, such as hand-held scanners, flat-bed scanners and page readers. The use of ANNs for text recognition began in the 1970s. Fukushima [Fuk80] presented a self-organizing neural network architecture to recognize simulated characters. This network consisted of two hidden layers and was trained in an unsupervised manner. The structure of this network was very similar to that proposed by Hubel and Wiesel [HW65] for visual nervous system. LeCun [LC89] introduced Convolutional Neural Networks (CNN) for isolated handwritten numeral recognition. In this type of neural networks, a specialized layer, referred to as feature maps, scans the input image at different locations to extract features. Multiple feature maps were used to extract different features from the image. Jackel et al. [JBB+95] used the above-mentioned CNNs “LeNet” for handwritten character recognition. This system was based on segmentation-based philosophy, where individual character candidates were extracted at first from a given string. The LeNet was then applied to see if it can recognize this with high confidence or not. A simple threshold then identified the candidates that were recognized with higher probability. Others were discarded as incorrect segments. Marinai et al. [MGS05] presented a survey on the use of ANNs for various tasks of a document analysis, including preprocessing, layout analysis, character segmentation, word recognition and signature verification. For the OCR task, they divided techniques for individual character recognition or word recognition. Hidden Markov Models (HMM) were proposed by Rabiner [Rab89] in 1989 and they became extremely popular in the field of continuous speech recognition. Schwartz et al. [SLMZ96] adapted a speech recognition system for a language-independent OCR system, called “BBN BYBLOS”. On a multifont Arabic OCR corpus, they reported a character error rate of 1.9%. El-Mahallaway [EM08] proposed an Omni-Font OCR system for Arabic script using HMM methodology. However, HMM-based systems are mainly employed for the handwritten text recognition similar to automatic speech recognition. Various attempts have been made to develop hybrid ANN/HMM methods to overcome the demerits of HMM-based methods. The ANN part of such systems extracts discriminative features and the HMM part is used as a segmentation-free recognizer. Rashid [Ras14] reported a system for printed OCR using this hybrid approach. In this method, Multilayer Perceptron (MLP) was used to extract the features from a given text-line. For this purpose, individual characters were extracted using the character segmentation method reported in [Bre01]. Features were extracted using a 30 × 20 window scanning over a character-segmented text-line. These features were extracted for the character and non-character classes. An AutoMLP [BS08] was used to learn them from 95 classes that include non-character (garbage) as well. HMM were then trained on these features using standard Baum-Welch algorithm. They reported an accuracy of 98.41% on standard dataset of printed English. Most of the literature reported for these hybrid systems combine MLPs with HMM; however, Graves [GS05] described the use of RNN/HMM hybrid for phoneme recognition. They reported that using the RNN/HMM hybrid produces better results than those reported for ANN/HMM hybrid or simple HMM-based approaches. 5.2.2 Database To evaluate the LSTM-based OCR method, we used UW3 dataset [LRHP97], representing 1,600 pages of document images from scientific journals and other common sources. Text-line images and corresponding GT text have been extracted from the data set using the transcriptions provided with this database. Text-lines containing mathematical equations are not used during either training or testing. Overall, we used a random subset of 95,338 text-lines in the training set and 1,020 text lines in the test set. Some of the sample images have been shown in Figure 3.1. 5.2.3 Experimental Evaluation and Results In the context of these experiments, the text-lines are normalized to a height of 32 in a preprocessing step. Both left-to-right and right-to-left LSTM layers contain 100 LSTM memory blocks. The learning rate is set to 1e−4, and the momentum to 0.9. The training is carried out for one million steps (roughly corresponding to 100 epochs, given the size of the training set). Test set errors are reported every 10, 000 training steps and plotted. The configuration of the trained LSTM network is shown in Figure 5.5. The two hidden layers correspond to the bidirectional mode of the LSTM network The process of LSTM training is illustrated in Figure 5.6. It is interesting to note from this figure that after seeing only 5400 text-lines, the network is able to recognize almost all characters. This shows that LSTM networks do not suffer generally from the ‘over-training’ problem. The LSTM network is able to achieve 0.6% (No. of total characters, N = 50,632) error on the test-set. To compare the results with other contemporary OCR systems, the same test set has been utilized. The results have been compared with an old version of OCRopus [Bre08], Tesseract [Smi13] and ABBYY [ABB13] systems. The Tesseract system achieves a recognition error of 1.299% when run in line-wise mode with an English language model; ABBYY achieves 0.85% using the “English” setting, and the OCRopus achieved 2.14%. Figure 5.7 presents the comparison of the three OCR systems on the UW3 data collection. It should be noted that all these systems employ language modelling techniques to post-process the raw output, and in some cases other sophisticated techniques like font recognition and adaptivity. The LSTM network, on the other hand, achieves the results without any language modelling, post-processing, adaptation, or use of a dictionary. The running time is under a second for a regular text line on a modern desktop PC.

5.2.4 Error Analysis There are a total of 313 errors and top confusions are ‘space’ deletions (34 times), ‘period’ confused with ‘comma’ (25), ‘space’ deletions (16), ‘period’ deletion (10), ‘comma’ confused with ‘period’ (6), ‘y’ confused with ‘v’ (5), ‘I’ deletions (5) and ‘i’ deletions (4). Representative inputs and outputs are shown in Figure 5.8. The LSTM networks are able to recognize the text in a variety of font-sizes, styles and degradations (touching characters, partial characters). Errors appear when a significant part of a character is missing or in the case of capital characters. 5.2.5 Conclusions The results presented in this section show that the combination of text-line normalization and 1D-LSTM yields excellent results for English OCR. Improvements of 0.3% – 0.6% error may seem small, but they are enormously significant both in terms of differences between OCR systems and practical applications, greatly reducing the need for manual post-processing. These results suggest that error rates for LSTMbased OCR without any language model are considerably lower than those achieved by segmentation-based approaches, HMM-based approaches, or commercial systems, even with language models. Treating the input text line like a sequence of “frames” over time is related to HMM-based approaches [LSN+99] and Document Image Decoding [KC94], but the LSTM approach has only three parameters, the input size, the number of memory units, and the learning rate.

5.3 Printed Devanagari OCR Hindi, the national language of India, is based on the Devanagari script. It is the fourth most popular language in the world with 400 million speakers. A great wealth of ancient classical literature, scientific, and religious books are available in Hindi/Sanskrit. Therefore, there is a great demand to convert them into machine readable documents. Moreover, a well developed OCR technology in Hindi can assist in various fields as demonstrated by the OCR of Latin script. However, despite the envisaged applications, the OCR research in Devanagari is still behind that of the Latin scripts. This can be attributed to various challenges. Section 3.5 discusses some of the challenges pertinent to Devanagari OCR in detail. This section2 describes the LSTM-based OCR solution for the printed Devanagari script. A short literature review is given in the following section before presenting the details about the experimental design and the results obtained. 5.3.1 Related Work The first efforts towards the recognition of Devanagari characters in printed documents started in the 1970s. Researchers at Indian Institute of Technology, Kanpur, India developed a syntactic pattern analysis system for the Devanagari script [SM73]. In the 1990s, Chadhuri et al. [CP95] and Pal et al. [PC97] developed the first complete end to end OCR system for Devanagari. Although in the 1990s OCR for Devanagari was restricted only at research level, in the early 2000s it took a major leap when Center for Development of Advance Computing (CDAC) India released the first commercial Hindi OCR called “Chitrankan” [Pal04]. 2This section is based on the work reported in [KUHB15]. Shaw et al. [SPS08] and Bhattacharya et al. [BPSB06] used HMMs for handwritten Devanagari character recognition. When using HMMs for OCR, statistical features play an important role. HMMs also require a large training set for estimating the parameters for a reliable recognition [Pal04]. Jawahar et al. [JKK03] used Support Vector Machines (SVM) for the recognition of printed Devanagari characters for a multilingual OCR engine. Principal Component Analysis (PCA) was used to reduce the dimensions of the feature space here. Even though SVMs are a good choice for the case where training data is limited, the main hurdle in use of the SVM method is the selection of a proper kernel [Pal04]. Bhattacharya et al. [BC09] used MLPs for the classification of Devanagari numerals. Each image was subjected to three MLP classifiers corresponding to a particular resolution. Singh et al. [SYVY10] used ANNs with features like mean distance, histogram of projection based on pixel location and their value. This ANN consisted of two hidden layers and was trained using the conventional backpropagation algorithm. ANNs are easier to implement and in addition to classification, also provide a confidence value for the classification [JDM00]. However, a major disadvantage with feedforward ANNs is that they cannot remember the context. Off late, LSTMs have also appeared in the Devanagari OCR research. Sankaran et al. [SJ12] has used bidirectional LSTMs for word classification in printed Devanagari documents with good results (Character Error Rate (CER) = 5.65%). 5.3.2 Database A new database, Deva-DB, has been proposed in this thesis to advance the OCR research in the printed Devanagari. Further details about this database can be found in Section 4.4.1. The test set is divided into two groups. The first set consists of 1000 synthetic text-line images generated from a different text corpus (other than the one used for training). The second set consists of a set of 621 real scanned text-line images (as described in Section 4.4.1) consisting of different fonts and different levels of degradations. Figure 5.9 shows few sample images from the second set containing the real scanned data. 5.3.3 Experimental Evaluation and Results Three experiments, differing in the number of fonts for train and test data, have been performed. These experiments can be classified into (i) Single Font-Single Font (traintest), (ii) Multi Font-Single Font, and (iii) Multi Font-Multi Font. These experiments differ significantly from the experiments done by Sankaran et al. [SJ12] since we use the whole unsegmented line images (which includes a space character as a label in the output layer) for training the network. Moreover, no handcrafted features are extracted from the text-line images. The line recognizer from the OCRopus OCR system has been adapted to work with Devanagari script. The LSTM network has been trained at two levels. The first one is with images only from a single font (Lohit Hindi) and the second one is trained with a training set consisting of images from multiple fonts (Lohit Hindi, Mangal, Samanata, Kalimati, NAGARI SHREE, Sarai, DV ME Shree). The results of the experiments are summarised in the Table 5.1. 5.3.4 Error Analysis Sample images of the scanned text-lines along with the LSTM-based line recognizer are shown in Fig. 5.9. Error made by the network is highlighted in red color. Apart from errors due to the confusions between similar shapes, one main reason for errors in these images is the ink spread on scanned text-line images. This ink spread makes it harder for LSTM networks to correctly recognize similar characters. Although the confusion matrix in Table 5.2 shows most of the confusions to be similar in shape, the top confusions (◌ं, ◌्, र) happen to be deletions (network missing the character) or insertion (network erroneously inserting a character). This appears very strange at the first look. But a deeper analysis of the problem leads us to the conjunct characters in Devanagari. The character ‘◌ं’ is a vowel which when combined with a consonant appears as a dot on the top of the consonant. For instance, when combined with the consonant ‘त’ (tha), the conjunct character would appear as ‘तं’ (than). A network trained with distorted images, can treat ‘◌ं’ (with a high probability) as a distortion in the image and would fail to detect it in some cases. The first image in Fig. 5.9 shows an example of such a deletion on the character ज. The character ‘◌्’ indicates that the preceding character should fuse with the succeeding character of ‘◌्’. If ‘प’ (pa) is to fuse with character ‘र’(ra), the code sequence would be ‘प’ + ‘◌्’+ ‘र’. The compound character has a shape ‘ू ’(pra) and it can be seen that the shape of the compound character represents ‘प’ (pa) more than ‘र’ (ra). This is the case with all the consonants which fuse with ‘र’ i.e. they take the shape of the first consonant. Therefore, this is highly probable that the network might predict it as ‘प’ (pa). When this happens, we have a deletion of two characters which are ‘◌्’ and ‘र’. This explains why the top three confusions are deletions. In the fourth image in Figure 5.9, the conjunct character ‘􀑀’ is replaced by consonant ‘ख’ because of a similar shape. This is one such example of deletion of ‘◌्’. The top deletion errors can be reduced by taking into consideration the pixel variation in the vertical direction as well. Whereas other substitution errors can be removed by training on data which has more samples of these substituted shapes. To compare the performance of LSTM networks, we evaluated the well known OCR engine Tesseract [Tes14] on our real test set. Tesseract results showed an error of 11.96% (with the default Hindi model) on the same test set where LSTM networks outputs an error of 9%. The confusion matrix from Tesseract also shows the top confusions to be deletions. The character ‘◌ं’ also appears as the top deleted character in the confusion matrix. 5.3.5 Conclusions The complex nature of the Devanagari script (involving fused/conjunct characters) makes the OCR research a challenging task. Since words in Devanagari have a connected form, we proposed LSTM as a suitable classifier. A new database, Deva-DB, comprising of GT text-line images from various scanned pages and synthetically generated text-lines, has been proposed. OCRopus line-recognizer has been adapted and trained on this database. This LSTM-based system yields a character error rate of 1.2% when the test fonts matched that of the training data but the error rate increased to 9% when tested on scanned data (containing different set of fonts). The important issue that the network faced while classifying the characters is that of conjunct characters and the cases where characters are vertically stacked. The shape and position of these vertically stacked glyphs vary widely with different fonts.

wanghaisheng / awesome-ocr

Adnan Ul-Hasan的博士论文-第五章印刷体的OCR #81

wanghaisheng / awesome-ocr

Adnan Ul-Hasan的博士论文-第五章 印刷体的OCR #81

Adnan Ul-Hasan的博士论文-第五章印刷体的OCR #81