Closed wanghaisheng closed 6 years ago
photo OCR的应用场景
Photo Optical Character Recognition (photo OCR), which aims to read scene text in natural images, is an essen- tial step for a wide variety of computer vision tasks, and has enjoyed significant success in several commercial applica- tions. These include street-sign reading for automatic navi- gation systems, assistive technologies for the blind (such as product-label reading), real-time text recognition and trans- lation on mobile phones, and search/indexing the vast cor- pus of image and video on the web.
The field of photo OCR has been primarily focused on constrained scenarios with hand-engineered image features. (Here, constrained means that there is a fixed lexicon or dictionary and words have known length during inference.). Specifically, examples of constrained text recognition methods include region-based binarization or grouping [5, 24, 33], pictorial structures with HOG features [47, 46], integer programming with SIFT descriptor [41], Conditional Random Fields (CRFs) with HOG features [32, 31, 39], Markov models with binary and connected component features [49]. Some early attempts [26, 53, 10] try to learn local mid-level representation on top of the handcrafted features, and some methods in [48, 19, 16] incorporate deep convolutional neural networks (CNNs) [25, 13] for a better image feature extraction. These methods work very well when candidate ground-truth word strings are known at testing stage, but do not generalize to words that are not present in the list of a lexicon at all
- 使用两个 CNN 一个用于对字符序列建模 一个用于N-gram 语言模型 然后使用 CRF 图模型将二者整合起来
A recent advance in the state-of-the-art that moves beyond this constrained setting was presented by Jaderberg et al. in [17]. The authors report results in the unconstrained setting by constructing two sets of CNNs – one for modeling character sequences and one for N-gram language statistics – followed by a CRF graphical model to combine their activations. This method achieved great success and set a new standard in photo OCR field. However, despite these successes, the system in [17] does have some drawbacks. For instance, the use of two different CNNs incurs a relatively large memory and computation cost. Furthermore, the manually defined N-gram CNN model has a large number of output nodes (10k output units for N = 4), which increases the training complexity – requiring an incremen- tal training procedure and heuristic gradient rescaling based on N-gram frequencies.
- 本文提出的新方法
Inspired by [17], we continue to focus our efforts on the unconstrained scene text recognition task, and we develop a recursive recurrent neural networks with attention modeling (R2AM) system that directly performs image to sequence (word strings) learning, delivering improvements over their work. The three main contributions of the work presented in this paper are: (1) Recursive CNNs with weight-sharing, for more effective image feature extraction than a “vanilla” CNN under the same parametric capacity. (2) Recurrent neural networks (RNNs) atop of extracted image features from the aforementioned recursive CNNs, to perform implicit learning of character-level language model. RNNs can automatically learn the sequential dy- namics of characters that are naturally present in word strings from the training data without the need of manually defining N-grams from a dictionary. (3) A sequential attention-based modeling mechanism that performs “soft” deterministic image feature selection as the character sequence is being read, and that can be trained end-to-end within the standard backpropagation. We pursue extensive experimental validation on chal- lenging benchmark datasets: Street View Text, IIIT5k, ICDAR and Synth90k. We also provide a detailed ablation study by examining the effectiveness of each of the pro- posed components. Our proposed network architecture achieves the new state-of-the-art results and significantly outperforms the previous best reported results for unconstrained text recognition [17]; i.e. we observe an absolute accuracy improvement of 9% on Street View Text and 8.2% on ICDAR 2013.
Hi, do you have the code of this paper? Thank you very much.
论文地址