版面分析/版式分析入门

wanghaisheng commented 6 years ago

https://github.com/tmbdev/teaching-dca Thomas_Breuel 开授的课程 1.转换成pdf 2.pdf转换成html 3.翻译

wanghaisheng commented 6 years ago

University of Kaiserslautern

http://www.iupr.com/ This is the home page of the Image Understanding and Pattern Recognition group at the University of Kaiserslautern. The group was headed from 2004-2014 by Prof. Dr. Thomas Breuel.

Prof. Breuel started working at Google in 2014, but still supervising several students in the department.

Publications of the research group from 2004-2014 can be found on the Publications Page.

From the Summer semester 2014 on Vertr.-Prof. Dr. Marcus Eichenberger-Liwicki is heading the group now as a substitute.

Adnan Ul-Hasan Dean of the Department: Prof. Dr. Klaus Schneider Chairperson of the PhD Committee: Prof. Dr. Paul Lukowicz Thesis Reviewers: Prof. Dr. Andreas Dengel, DFKI Kaiserslautern Associate Prof. Dr. Faisal Shafait, SEECS, NUST Pakistan apl. Prof. Dr. Marcus Liwicki, University of Kaiserslautern

wanghaisheng commented 6 years ago

Thomas Breuel CV (on www.9x9.com) - Google 文档.pdf http://www.9x9.com/

110-text-recognition.pptx

wanghaisheng commented 6 years ago

http://coen.boisestate.edu/EBarneySmith/sp_lab/past_projects/document-imaging-defect-analysis/ DESCRIPTION: Dr. Barney’s PhD dissertation focused on modeling the imaging process of a desktop document scanner and evaluating how that produced degradations in bilevel document images. Much of her early work expanded on this topic. To improve the performance of DIA, four major themes were investigated:

Model the nonlinear systems of printing, scanning, photocopying and FAXing, and multiple combinations of these, that produce degraded images, and develop methods to calibrate these models. From a calibrated model one can predict how a document will look after being subjected to these processes. This can be used to develop products that degrade text images less. Statistically validate these models. This will give other researchers the confidence to use these models to create large training sets of synthetic characters, with which they can conduct controlled DIA and OCR experiments. Estimate the parameters to these models from a short character string to allow continuous calibration to account for spatially-variant systems. OCR Training: Determine how these models and parameters can best be used to improve OCR accuracy by partitioning the training set based on modeled degradations and matching the appropriate partition to the test data at hand. Filter: Improve the image quality by selecting a filter based on the degradations that are present and the process that caused that degradation.

wanghaisheng commented 6 years ago

APPLICATIONS Document Image Analysis can be applied to may applications beyond the desktop OCR package that comes with most commercial scanners or PDF readers. Some applications include:

Reading books and documents for visually impaired Conversion of books to digital libraries Signature verification Reading license plate or cargo container numbers Reading road signs for autonomous or semi-autonomous vehicles PDA or tablet PC technology Sorting of large document datasets (legal, historical, security) Search engines on the Web MOTIVATION Document Image Analysis aims to develop algorithms and processes through which machines (computers) can automatically read and develop some basic understanding of documents. Documents include

Machine printed documents – such as memos, letter technical reports, books. Hand written documents – personal letters, addresses on postal mail, notes in the margins of documents. On-line handwritten documents – writing on PDAs or tablet PCs. Video documents – annotating videos based on text in the video clips. Music scores – turning sheet music into MIDI or other electronic music formats. The growth of the World Wide Web has made it easier to make information publicly available, but to make that information useful it must be in computer readable form so it can be searched and the items of interest retrieved. Documents are converted to computer readable form through the process of Document Image Analysis (DIA) which encompasses the process of Optical Character Recognition (OCR). An image of a document is made and the text content must be recognized in order to be searchable. An automated OCR system can reduce the time needed to convert a document to computer readable form to 25% of the time a human needs to hand enter the same data. Although much effort has been dedicated to developing methods of automatically converting paper documents into electronic form, and OCR products are commercially available, often for free, many documents that are easy for humans to read still have only 92% recognition accuracy. This is too high to remove the human from the process, increasing the time and cost of document conversion. Thus there is a need for further research in this field.

Low accuracy rates are most common in documents with image degradations caused by printing, scanning, photocopying and/or FAXing documents. These four operations all share the processes of spatial and intensity quantization. These are the primary sources that change the appearance of bilevel images such as characters and line drawings. Camera-based acquisition (such as with a cell phone) adds to the degradation by introducing out-of-focus degradations, and perspective distortions. To date the most common method of overcoming these degradations is to provide the classifier with enough variety of samples that the classifier can recognize the degraded characters. However, by understanding the degradation and being able to estimate the degradation characteristics for each document, a more effective method of preprocessing or recognizing the characters can be developed.

wanghaisheng commented 6 years ago

http://cvit.iiit.ac.in/SSDA/program.html Talk Details Cognitive Natural Language Processing

GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION .pdf Language Model- Theory and Applications -IAPRlab-master_Utkarsh_Demo.zip Historical Document Analysis -2017-summer-school-MarcusLiwicki.pdf Detection and cleaning of strike-out texts in offline handwritten documents -Jaipur Strike_out_BBChaudhuri.pdf Word Spotting- From Bag-of-Features to Deep Learning -SSDA17-Tutorial-Fink.pdf Document Page Layout Analysis Analysis-SSDA_Jaipur_BhabatoshChanda.pdf Developing Multilingual OCR and Handwriting Recognition at Google.pdf

Monday, January 23

Speaker: Prof. Pushpak Bhattacharya

Abstract: We present in this talk the use of eye tracking for Natural Language Processing, which we call Cognitive Natural Language Processing. NLP is machine learning dependent these days, and clues from eye tracking provide valuable features in ML for NLP. We study Machine Translation, Sentiment Analysis, Readability, Sarcasm and such problems to show that cognition based features augment the efficacy of ML based NLP manifolds. An additional attractiveness of cognitive NLP is possible rationalization of compensation for annotation effort on text. The presentation is derived from multiple publications in ACL, EMNLP, NAACL etc. based on work done by PhD and Masters Students.

Bio: Prof. Pushpak Bhattacharyya is the current President of ACL (2016-17). He is the Director of IIT Patna and Vijay and SitaVashee Chair Professor in IIT Bombay, Computer Science and Engineering Department. He was educated in IIT Kharagpur (B.Tech), IIT Kanpur (M.Tech) and IIT Bombay (PhD). He has been visiting scholar and faculty in MIT, Stanford, UT Houston and University Joseph Fouriere (France). Prof. Bhattacharyya's research areas are Natural Language Processing, Machine Learning and AI. He has guided more than 250 students (PhD, masters and Bachelors), has published more than 250 research papers and led government and industry projects of international and national importance. A significant contribution of his is Multilingual Lexical Knowledge Bases and Projection. Author of the text book "Machine Translation", Prof. Bhattacharyya is loved by his students for his inspiring teaching and mentorship. He is a Fellow of National Academy of Engineering and recipient of Patwardhan Award of IIT Bombay and VNMM award of IIT Roorkey- both for technology development, and faculty grants of IBM, Microsoft, Yahoo and United Nations.

Developing Multilingual OCR and Handwriting Recognition at Google

Monday, January 23

Speaker: Dr. Ashok Popat

Lecture Slides

Abstract: In this talk I will I reflect on our team's experiences in developing a multilingual OCR and handwriting recognition systems at Google: enabling factors, effective practices, and challenges. I'll tell you what I think I've learned along the way, drawing on some experiences with other projects inside and outside Google.

Bio: Dr. Ashok C. Popat received the SB and SM degrees from the Massachusetts Institute of Technology in Electrical Engineering in 1986 and 1990, and the PhD from the MIT Media Lab in 1997. He is a Staff Research Scientist and manager at Google in Mountain View, California. Prior to joining Google in 2005 he worked at Xerox PARC for 8 years, as a researcher and later as a research area manager. Between 2002 and 2005 he was also a consulting assistant professor of Electrical Engineering at Stanford, where he taught a course "Electronic documents: paper to digital." He has also worked at Motorola, Hewlett Packard, PictureTel, and the EPFL in Switzerland. His areas of interest include signal processing, data compression, machine translation, and pattern recognition.Personal: skiing, sailing, hiking, traveling, learning languages.

Word Spotting: From Bag-of-Features to Deep Learning

Tuesday, January 24

Speaker: Prof. Gernot Fink

Abstract: Research in building automatic reading systems has made considerable progress since its first inception in the 1960's. Today, quite mature techniques are available for the automatic recognition of machine-printed text. However, the automatic reading of handwriting is a considerably more challanging task, especially when it comes to historical manuscripts. When current methods for handwriting recognition reach their limits, approaches for so-called word spotting come into play. These can be considered as specialized versions of image retrieval techniques. The most successful methods rely on machine learning methods in order to derive powerful models for representing queries for handwriting retrieval.

This lecture will first give a brief introduction to the problem of word spotting and the methodological developments in the field. In the first part of the lecture, classical approaches for learning word spotting models will be described that build on on Bag-of-Features (BoF) representations. These have been developed in the field of computer vision for learning characteristic representations of image content in an unsupervised manner. It will be shown how word spotting models can be built applying the BoF principle. It will also be described, how basic BoF models can be extended by learning common sub-space representations between different modalities.

In the second part of the lecture, advanced models for word spotting will be presented that apply techniques of deep learning and, currently, define the state-of-the-art in the field. After a discussion of pros and cons of the classical approaches, first foundations of neural networks in general and deep architectures in particular will be laid. Combining the idea of common sub-space representations and the application of a unified framework that can be learned in an end-to-end fashion, unprecedented performance on a number of challenging word spotting tasks can be achieved, as has been demonstrated by the PHOCNet.

Bio: Prof. Gernot A. Fink received his diploma in computer science from the University of Erlangen-Nuremberg, Germany, in 1991. From 1991 to 2005, he was with the Applied Computer Science Group at Bielefeld University, Germany, where he received his Ph.D. degree (Dr.- Ing.) in 1995 and his venialegendi (Habilitation) in 2002. Since 2005, he has been a professor at the Technical University of Dortmund, Germany, where he heads the Pattern Recognition in Embedded Systems Group. His research interests are machine perception, statistical pattern recognition, and document analysis. He has published more than 150 papers and a textbook on Markov models for pattern recognition.

Lab-Session: In the accompanying lab-session, participants of the summer school will be able to experiment themselves with different word spotting models and thus obtain hands-on experience with the techniques presented in the lecture.

Lecture Slides: http://patrec.cs.tu-dortmund.de/pubs/papers/SSDA17-Tutorial-Fink.pdf

Detection and cleaning of strike-out texts in offline handwritten documents

Tuesday, January 24

Speaker: Prof. B. B. Chaudhuri

Lecture Slides

Abstract: The talk starts with brief study on OCR of offline unconstrained handwritten text, including our BLSTM based work on Bangla script. It is noted that the published papers on the topic consider ideal inputs, i.e. the documents containing no writing error. However, a free-form creative handwritten page may contain misspelled/inappropriate word, that is struck-out by the writer and the adequate word is written next to it. The strike-out may also be longer e.g. consisting several consecutive words, even several lines, after which the writer pens his/her revised statement at the next free space. If a document image with such errors is fed to handwriting OCR, then unpredictable erroneous strings will be generated for the struck-out texts. The present talk mainly deals with such strike-out problem in English and Bangla script. Here a pattern classifier followed by a graph based method is employed to detect struck-out text and locate the strike-out strokes. For detection, we employed hand-crafted as well as Recurrent Neural Net generated features into a SVM classifier to detect the struck-out words. Then, to locate the strike-out stroke, the skeleton of the text component is computed. The skeleton is treated as a graph and a shortest-path algorithm, which satisfies certain properties of strike-out stroke is employed. To locate the zig-zag, wavy, slanted or crossed strike-outs, appropriate modification in the path detection algorithm is made. Multiword/multiline strike-outs are also tackled in a suitable manner.

Sometimes the user may be interested in deleting the detected strike-out stroke. When this is done, the cleaned text may be better visible for manual analysis, or subjected to OCR system for transcript generation of a manuscript (of say, a famous person). We have employed Inpainting method for such cleaning. Tested on 250 English and 250 Bangla document pages, fairly good results on the above tasks have been obtained.

Bio: Prof. Bidyut B. Chaudhuri received Ph.D. degree from Indian Institute of Technology, Kanpur, in 1980 and worked as a Lever hulme PostDoc fellow at Queen's University, UK, in 1981-1982. He joined Indian Statistical Institute in 1978, where he is currently INAE Distinguished Professor and J.C.Bose Fellow at Computer Vision and Pattern Recognition Unit. His research interests include pattern recognition, image processing, computer vision, NLP, information retrieval, digital document processing and OCR. He pioneered the first Indian language Bharati Braille System for the blind, a successful Bangla speech synthesis system, as well as the first workable OCR for Bangla, Devanagari, Assamese and Oriya scripts. In NLP, a robust Indian language spell-checker, morphological processor, multi-word expression detector and statistical analyser were pioneered by him.

Some of his technologies have been transferred to industry for commercialization. He has published about 400 research papers in reputed international journals, conference Proceedings, and edited books. He has authored/co-authored 8technical books and holds four international patents. He is a Fellow of Indian national academies like INSA, NASc and INAE. Among International academies, he is a Fellow of IAPR and TWAS, and a Life Fellow of IEEE. He is serving as an Associate editor of IJPRAI, IJDAR, JIETE and served as guest editor to special issues of several journals.

Reading behavior analysis for reading-life logand its fundamental technologies

Wednesday, January 25

Speaker: Koichi Kise

Lecture Slides

Abstract: In our daily life, we are spending hours for reading documents. This is because “reading” is our primal mean of acquiring information. “Reading-life log” is a field of research to extract fruitful information for enriching our life by mutual analysis of reading activity and documents read by readers. We can estimate many things from the results of analysis, e.g., how much we read (wordometer, reading detection), and how well we understand (the level of understanding and proficiency), both by analyzing eye gaze obtained by eye-trackers. Fundamental technologies which support reading-life log are sensing human reading behavior and retrieval of documents inputted as images. In my talk, I introduce the fundamental technologies and their application to implementation of various types of reading-life log.

Bio: Prof. Koichi Kise received B.E., M.E. and Ph.D. degrees in communication engineering from Osaka University, Osaka, Japan in 1986, 1988 and 1991, respectively. From 2000 to 2001, he was a visiting professor at German Research Center for Artificial Intelligence (DFKI), Germany. He is now a Professor of the Department of Computer Science and Intelligent Systems, and the director of the Institute of Document Analysis and Knowledge Science (IDAKS), Osaka Prefecture University, Japan. He received awards including the best paper award of IEICE in 2008, the IAPR/ICDAR best paper awards in 2007 and 2013, the IAPR Nakano award in 2010, the ICFHR best paper award in 2010 and the ACPR best paper award in 2011. He works as the chair of the IAPR technical committee 11 (reading systems), a member of the IAPR conferences and meetings committee, and an editor-in-chief of the international journal of document analysis and recognition. His major research activities are in analysis, recognition and retrieval of documents, images and activities. He is a member of IEEE, ACM, IPSJ, IEEJ, ANLP and HIS

Demo:I will demonstrate fundamental technologies and implementations of reading-life log using some sensors. Document image retrieval called LLAH (Locally Likely Arrangement Hashing) is a fundamental technology to be demonstrated. I also show several sensing technologies such as eye-tracking and EOG (electrooculography).

Students are able to try to use sensors to know more about their functions. In addition, students have an opportunity of implementing simple activity recognition by using an eye-tracker.

Document page layout analysis

Wednesday, January 25

Speaker: Prof. Bhabatosh Chanda

Lecture Slides

Abstract: ‘Document page layout analysis’ usually refers to decomposition of page image into textual and various non-textual components, to understand geometrical and logical structure, and thereafter linking them together for efficient presentation and abstraction. With the growing necessity in automatic transformation of complex paper document to its electronic version, geometrical and logical structure analysis remains an active research area for decades. Such analysis helps ‘OCR’ to produce its best possible result. It also helps extracting various logical components such as image and line drawing. In this presentation our objective is to make a quick journey starting from elementary approach suitable for strictly structured layout to more sophisticated methods that can handle complicated designer layout. We also discuss evaluation methodology for layout analysis algorithms and mention various benchmark datasets available for performance evaluation.

Bio: Prof. Bhabatosh Chanda received B.E. in Electronics and Telecommunication Engineering and PhD in Electrical Engineering from University of Calcutta in 1979 and 1988 respectively. His research interest includes Image and video Processing, Pattern Recognition, Computer Vision and Mathematical Morphology. He has published more than 100 technical articles in refereed journals and conferences, authored one book and edited five books. He has received ‘Young Scientist Medal’ of Indian National Science Academy in 1989, ‘Computer Engineering Division Medal’ of the Institution of Engineers (India) in 1998, ’Vikram Sarabhai Research Award in 2002, and IETE-Ram Lal Wadhwa Gold medal in 2007. He is also recipient of UN fellowship, UNESCO-INRIA fellowship and Diamond Jubilee fellowship of National Academy of Science, India. He is fellow of Institute of Electronics and Telecommunication Engineers (FIETE), of National Academy of Science, India (FNASc.), of Indian National Academy of Engineering (FNAE) and of International Association of Pattern Recognition (FIAPR). He is a Professor in Indian Statistical Institute, Kolkata, India.

Historical Document Analysis

Friday, January 27

Speaker: Prof. Marcus Liwicki

Lecture Slides

Abstract: I will give an overview over the challenges of historical documents and the current research highlights for various document image analysis (DIA) problems. Historical Documents pose very tough challenges to automatic DIA algorithms. Typically, exotic scripts and layouts have been used and the documents degraded over time. I will give an overview over typical processing algorithms and furthermore report on recent trends towards interoperability.

In the first part of the presentation, I will describe methods for line segmentation, binarization, and layout analysis. Especially very recent deep learning trends led to remarkable improvements of the processing systems when compared to conventional methods. On top of that, if enough data is available, those methods are also much easier to apply since they perform end-to-end recognition and make several processing steps obsolete. On the basis of examples, I will show that the separation of the analysis into several independent steps even leads to problems and worse performance of the later methods. The reasons for that are twofold: First, it is not clear how to define the ground truth (i.e., the expected perfect outcome) of some individual steps; second, early recognition errors can lead to much more difficult processing for the later stages. The only remaining problem for deep learning is the need for large amount of training data. I will demonstrate methods to automatically extend existing ground truthed datasets for more training data generation.

In the second part, I will sketch recent approaches of the Document, Image, and Voice Analysis (DIVA) group towards enabling libraries and researchers in the humanities for easier use of state-of-the-art DIA methods. Common structures, adaptable methods, public datasets, and Open Services (e.g., the DIVAServices which will be more deeply presented by Marcel Würsch in the next presentation) lead to easier re-use, access, and integration into tools used at the libraries or archives or in research environments

Lab: It will involve hands-on practices on DIVAServices, web services for Document Image Analysis. The participants will be able to try out state-of-the-art Document Image Processing methods and learn how to easily integrate their own methods into DIVAServices.

Bio: Marcus Liwicki received his M.S. degree in Computer Science from the Free University of Berlin, Germany, in 2004, his PhD degree from the University of Bern, Switzerland, in 2007, and his habilitation degree at the Technical University of Kaiserslautern, Germany, in 2011. Currently he is an apl.-professor in the University of Kaiserslautern and a senior assistant in the University of Fribourg. His research interests include machine learning, pattern recognition, artificial intelligence, human computer interaction, digital humanities, knowledge management, ubiquitous intuitive input devices, document analysis, and graph matching. From October 2009 to March 2010 he visited Kyushu University (Fukuoka, Japan) as a research fellow (visiting professor), supported by the Japanese Society for the Promotion of Science. In 2015, at the young age of 32, he received the ICDAR young investigator award, a bi-annual award acknowledging outstanding achievements of in pattern recognition for researchers up to the age of 40. Marcus Liwicki gave a number of invited talks at several international workshops, universities, and companies. He also gave several tutorials on IAPR conferences. Marcus Liwicki is a co-author of the book "Recognition of Whiteboard Notes – Online, Offline, and Combination", published by World Scientific in October 2008. He has more than 150 publications, including more than 20 journal papers, and excluding more than 20 publications which currently undergo the review stage or will soon be published.

Analyzing text documents - separating the wheat from chaff

Friday, January 27

Speaker: Dr. Lipika Dey

Lecture Slides

Abstract: The rapid rise of digital text document collections is exciting for decision makers across different sections - be it from academia or industry. While the academia is interested to gather insights about scientific and technical progress in different areas of research, industry is interested to know more about its potential consumers and competitors. All this and much more is available today almost free of cost on the open web. However, text data can be extremely noisy and deceptive. Noise creeps in from various sources - some intended and some unintended. While some of this noise can be treated at pre-processing levels, some need to be dealt with during the analysis process itself. In this talk we shall take a look at the various pitfalls that need to be carefully avoided or taken care of in order to come up with meaningful insights from text documents.

Demo: Texcape Given the volumes and velocity at which research publications are growing, keeping up with the advances in various fields is a challenging task. However decision makers including academics, program managers, venture capital investors, industry leaders and funding agencies not only need to be abreast of latest developments but also be able to assess the future impact of research on industry, academics or society. Automated extraction of key information and insights from these text documents is necessary to help in this endeavor. Texcape is a technology landscaping tool built on top of scientific publications and patents, that attempts to help in this task. This demo will show how Texcape performs automated topical analysis from large volumes of text and analyzes evolutions, commercialization’s and trends to help in collaborative decision making.

Bio: Dr. LipikaDey is a Senior Consultant and Principal Scientist at Tata Consultancy Services, India with over 20 years of experience in Academic and Industrial R&D. She heads the Web Intelligence and Text Mining research group at Innovation Labs. Lipika's research interests are in the areas of content analytics from social media and News, social network analytics, predictive modeling, sentiment analysis and opinion mining, and semantic search of enterprise content. Her focus is on seamless integration of social intelligence and business intelligence. She is keenly interested in developing analytical frameworks for integrated analysis of unstructured and structured data. Lipika publishes her work in various International Conferences and Journals. She has also presented her earlier works at Sentiment Analysis Symposium and Text Mining Summit. Lipika was awarded with the Distinguished Scientist award by TCS in 2012. Prior to joining the industry in 2007, Lipika was a faculty member in the Department of Mathematics at Indian Institute of Technology, Delhi, from 1995 to 2006. She has several publications in International journals and refereed conference proceedings. Lipika has a Ph.D. in Computer Science and Engineering, M.Tech in Computer Science and Data Processing and 5 Year Integrated M.Sc in Mathematics from IIT Kharagpur.

Language Model: Theory and Applications

Saturday, January 28

Speaker: Dr. Utkarsh Porwal

Lab related material

Abstract: A language model helps us compute the probability of sequence of terms such as words given a /corpus/. It is widely used in applications like spell correction, POS tagging, information retrieval, speech recognition and handwriting recognition. In this talk, we will cover theory of language models from n-gram based models to recent RNN based models, parameter estimation, evaluation etc. We will also cover a wide range of applications where language modeling is used.

Lab: In this lab session, participants will learn to train and evaluate different types of language models such as n-gram based model and RNN based model and will be able to compare them based on performance, data efficiency, storage etc.

Bio: Dr. UtkarshPorwal is an applied researcher at eBay. He works on automatic query rewrites, entity recognition and structured data. Before joining search science, he was part of the trust science group where he was working on detecting abusive buyers and feature selection. His research interest lies broadly in the areas of information retrieval, pattern recognition and applied machine learning. He received his Ph. D. from State University of New York at Buffalo in 2014.

Extreme Classification for Tagging on Wikipedia, Query Ranking on Bing and Product Recommendation on Amazon

Saturday, January 28

Speaker: Prof. Manik Verma

Abstract: The objective in extreme classification is to develop classifiers that can automatically annotate each data point with the most relevant subset of labels from an extremely large label set. In this talk, we will develop a new paradigm for tagging, ranking and recommendation based on extreme classification. In particular, we design extreme multi-label loss functions which are tailored for tagging, ranking and recommendation and show that these loss functions are more suitable for performance evaluation as compared to traditional metrics. Furthermore, we develop novel algorithms for optimizing the proposed loss functions and demonstrate that these can lead to a significant improvement over the state-of-the-art on various real world applications ranging from tagging on Wikipedia to sponsored search advertising on Bing to product recommendation on Amazon. More details including publications, videos, datasets and source code can be found on http://www.manikvarma.org/.

Brief Bio: Prof. Manik Varma is a researcher at Microsoft Research India and an adjunct professor of computer science at IIT Delhi. His research interests span machine learning, computational advertising and computer vision. He has served as an area chair for CVPR, ICCV, ICML, ICVGIP, IJCAI and NIPS. Classifiers that he has developed are running live on millions of devices around the world protecting them from viruses and malware. Manik has been awarded the Microsoft Gold Star award, the Microsoft Achievement award, won the PASCAL VOC Object Detection Challenge and stood first in chicken chess tournaments and Pepsi drinking competitions. He is a failed physicist (BSc St. Stephen's College, David Raja Ram Prize), theoretician (BA Oxford, Rhodes Scholar), engineer (DPhil Oxford, University Scholar) and mathematician (MSRI Berkeley, Post-doctoral Fellow)

System Demo: Demo of Indian Language OCRs

Mr. Tushar Patnayak CDAC, Noida

Abstract: Indian Language OCR: e-Aksharayan, an Indian language OCR facilitates converting hardcopy printed documents into electronic forms using a new approachleading to, for the first time, a technology for recognizing characters and words in scanned images ofdocuments in a large set of Indian scripts/languages.Optical Character Recognition (OCR) for Indian scripts opens up the possibility of delivering traditionalIndian language content, which today are confined to printed books, to readers across the world throughelectronic means. OCR makes the content searchable as well as readable via a variety of devices like mobile phones, tablets, e-readers. Further, the same content can now be transformed electronically to meetneeds of the visually challenged through generation of Braille and/or audio books among otherpossibilities. Use of OCR on printed Indian language circulars and notifications can make embedded information widely accessible facilitating effective e-governance. The circulars can be now very easilyedited, if required, for adaptation to different needs.The OCR process involves first converting printed matter into electronic image using scanner or a digitalcamera, followed by electronic image processing to generate Unicode text. This can be opened in any word-processing application for editing. e-Aksharayan has user-friendly design and allows intuitive editing of the scanned image and the generated text.

Features of e-Aksharayan are:

It enables users to harness the power of computersto access printed documents in Indian language/scripts. A number of pre-processing routines are available such as skew detection and correction, Noise removal and thresholding to convert an input gray-scale document image into clean binary image for successful recognition. Other pre-processing steps can be color image processing, Dithering/Color Highlight/Color Stamp/Underline/Annotation/Marginal Noise Removal and Text-NonText Separation. Present version of e-Aksharayan supports major Indian languages/scripts- Assamese, Bangla, Gurmukhi, Hindi, Kannada, Malayalam, Tamil, Telugu, Urdu, Gujarati, Oriya, Manipuri and Marathi. It converts printed document images to editable text with up to 90-95% recognition accuracy at character level &85-90% at word level. Current version of e-Aksharayan takes 45 to 60 sec to process an A4 size page on a standard desktop. The digitized text can be converted to Braille for the visually impaired. Other applications that can be built around with OCR technology at hand can be text-to-speech conversion for visually impaired, proof-reading for authors, Search Engine and Content analysis, Multilingual tool formitigating code of conduct cases at Election Commission, Interactive learninggames/toys forchildren tounderstandletter/wordformation, Android app for OCR based dictionary and translator for recognition of multilingual scene text captured from sign-boards, hoardings etc. Demo of Indian Language OHWRs

Swapnil Belhe, CDAC Pune

Abstract: With the recent advancement in Indian languages Optical Character Recognition (OCR) and Online Handwritten Character Recognition(OHWR) engines, there has been wide variety of applications which are developed around these engines to cater to various needs. The engines make use of the latest developments in document and handwriting analysis making them robust to font and writing style variations. Most of the OCR and OHWR engines make use of huge collection of data during training making them robust.

The demonstrations will focus on desktop and mobile based OCR’s for Indian languages and their complexities. At the same time, the demonstrations of OHWR’s will show the effectiveness of handwritten recognition for handheld devices. The effective way of multi-modal inputting for form processing in Indian languages using handwritten recognition will be showcased. Various learning games developed using the OCR’s & OHWR’s will be demonstrated. These demos will also provide the glimpse of future challenges.

wanghaisheng commented 6 years ago

In order to create general page segmentation method without using any prior knowledge of the layout structure of the documents, we consider the page segmentation problem as a pixel labeling problem. We propose to use a CNN for the pixel labeling task https://arxiv.org/pdf/1704.01474.pdf

wanghaisheng commented 6 years ago

[10] J. Pastor-Pellicer, M. Z. Afzal, M. Liwicki, and M. J. Castro-Bleda, “Complete system for text line extraction using convolutional neural networks and watershed transform,” in Document Analysis Systems (DAS), 2016 12th IAPR Workshop on. IEEE, 2016, pp. 30–35. [11] M. Seuret, M. Alberti, R. Ingold, and M. Liwicki, “Pca-initialized deep neural networks applied

Fast CNN-based document layout analysis http://openaccess.thecvf.com/content_ICCV_2017_workshops/papers/w18/Oliveira_Fast_CNN-Based_Document_ICCV_2017_paper.pdf

Automatic document layout analysis is a crucial step in cognitive computing and processes that extract information out of document images, such as specific-domain knowledge database creation, graphs and images understanding, extraction of structured data from tables, and others. Even with the progress observed in this field in the last years, challenges are still open and range from accurately detecting content boxes to classifying them into semantically meaningful classes. With the popularization of mobile devices and cloud-based services, the need for approaches that are both fast and economic in data usage is a reality. In this paper we propose a fast one-dimensional approach for automatic document layout analysis considering text, figures and tables based on convolutional neural networks (CNN). We take advantage of the inherently one-dimensional pattern observed in text and table blocks to reduce the dimension analysis from bi-dimensional documents images to 1D signatures, improving significantly the overall performance: we present considerably faster execution times and more compact data usage with no loss in overall accuracy if compared with a classical bidimensional CNN approach.

wanghaisheng commented 6 years ago

Table Detection Using Deep Learning

https://www.researchgate.net/publication/320243569_Table_Detection_Using_Deep_Learning

Table detection is a crucial step in many document analysis applications as tables are used for presenting essential information to the reader in a structured manner. It is a hard problem due to varying layouts and encodings of the tables. Researchers have proposed numerous techniques for table detection based on layout analysis of documents. Most of these techniques fail to generalize because they rely on hand engineered features which are not robust to layout variations. In this paper, we have presented a deep learning based method for table detection. In the proposed method, document images are first pre-processed. These images are then fed to a Region Proposal Network followed by a fully connected neural network for table detection. The proposed method works with high precision on document images with varying layouts that include documents, research papers, and magazines. We have done our evaluations on publicly available UNLV dataset where it beats Tesseract's state of the art table detection system by a significant margin.

Table Detection Using Deep Learning (PDF Download Available). Available from: https://www.researchgate.net/publication/320243569_Table_Detection_Using_Deep_Learning [accessed Apr 26 2018].

wanghaisheng commented 6 years ago

A Two-Stage Method for Text Line Detection in Historical Documents https://arxiv.org/abs/1802.03345

This work presents a two-stage text line detection method for historical documents. In a first stage, a deep neural network called ARU-Net labels pixels to belong to one of the three classes: baseline, separator or other. The separator class marks beginning and end of each text line. The ARU-Net is trainable from scratch with manageably few manually annotated example images (less than 50). This is achieved by utilizing data augmentation strategies. The network predictions are used as input for the second stage which performs a bottom-up clustering to build baselines. The developed method is capable of handling complex layouts as well as curved and arbitrarily oriented text lines. It substantially outperforms current state-of-the-art approaches. For example, for the complex track of the cBAD: ICDAR2017 Competiton on Baseline Detection the F-value is increased from 0.859 to 0.922. The framework to train and run the ARU-Net is open source.

wanghaisheng commented 6 years ago

Fully Convolutional Neural Networks for Page Segmentation of Historical Document Images https://arxiv.org/abs/1711.07695

We propose a high-performance fully convolutional neural network (FCN) for historical document segmentation that is designed to process a single page in one step. The advantage of this model beside its speed is its ability to directly learn from raw pixels instead of using preprocessing steps e. g. feature computation or superpixel generation. We show that this network yields better results than existing methods on different public data sets. For evaluation of this model we introduce a novel metric that is independent of ambiguous ground truth called Foreground Pixel Accuracy (FgPA). This pixel based measure only counts foreground pixels in the binarized page, any background pixel is omitted. The major advantage of this metric is, that it enables researchers to compare different segmentation methods on their ability to successfully segment text or pictures and not on their ability to learn and possibly overfit the peculiarities of an ambiguous hand-made ground truth segmentation.

wanghaisheng commented 6 years ago

文档分析技术研究现状与趋势 Cheng-Lin Liu (刘成林) PhD, Professor 中国科学院自动化研究所副所长

模式识别国家重点实验室 主 任

http://www.nlpr.ia.ac.cn/liucl/DA%E7%A0%94%E7%A9%B6%E7%8E%B0%E7%8A%B6%E4%B8%8E%E8%B6%8B%E5%8A%BF.pdf

wanghaisheng commented 6 years ago

https://pdfs.semanticscholar.org/presentation/0907/0b09d860a639577a9b5219d065bc47fa28de.pdf Document Layout Analysis By: Garrett Hoch 0b09d860a639577a9b5219d065bc47fa28de.pdf

wanghaisheng commented 6 years ago

Open Evaluation Tool for Layout Analysis of Document Images https://arxiv.org/abs/1712.01656

评估工具代码 https://github.com/DIVA-DIA/DIVA_Layout_Analysis_Evaluator

This paper presents an open tool for standardizing the evaluation process of the layout analysis task of document images at pixel level. We introduce a new evaluation tool that is both available as a standalone Java application and as a RESTful web service. This evaluation tool is free and open-source in order to be a common tool that anyone can use and contribute to. It aims at providing as many metrics as possible to investigate layout analysis predictions, and also provide an easy way of visualizing the results. This tool evaluates document segmentation at pixel level, and support multi-labeled pixel ground truth. Finally, this tool has been successfully used for the ICDAR2017 competition on Layout Analysis for Challenging Medieval Manuscripts.

wanghaisheng commented 6 years ago

Text and non-text separation in offline document images: a survey

https://link.springer.com/article/10.1007%2Fs10032-018-0296-z

Separation of text and non-text is an essential processing step for any document analysis system. Therefore, it is important to have a clear understanding of the state-of-the-art of text/non-text separation in order to facilitate the development of efficient document processing systems. This paper first summarizes the technical challenges of performing text/non-text separation. It then categorizes offline document images into different classes according to the nature of the challenges one faces, in an attempt to provide insight into various techniques presented in the literature. The pros and cons of various techniques are explained wherever possible. Along with the evaluation protocols, benchmark databases, this paper also presents a performance comparison of different methods. Finally, this article highlights the future research challenges and directions in this domain.

wanghaisheng commented 6 years ago

Learning to detect tables in document images using line and text information

wanghaisheng commented 6 years ago

vanBeusekom--DA--Document-Layout-Analysis.pdf 毕业论文版面分析

wanghaisheng commented 6 years ago

http://ccis2k.org/iajit/PDF/July%202018,%20No.%204/10223.pdf A Hybrid Technique for Annotating Book Tables

Table extraction is usually complemented with the table annotation to find the hidden semantics in a particular piece of document or a book. These hidden semantics are determined by identifying a type for each column, finding the relationships between the columns, if any, and the entities in each cell. Though used for the small documents and web-pages, these approaches have not been extended to the table extraction and annotation in the book tables. This paper focuses on detecting, locating and annotating entities in book tables. More specifically it contributes algorithms for identifying and locating the tables in books and annotating the table entities by using the online knowledge source DBpedia Spotlight. The missing entities from the DBpedia Spotlight are then annotated using Google Snippets. It was found that the combined results give higher accuracy and superior performance over the use of DBpedia alone. The approach is a complementary one to the existing table annotation approaches as it enables us to discover and annotate entities that are not present in the catalogue. We have tested our scheme on Computer Science books and got promising results in terms of accuracy and performance. A Hybrid Technique for Annotating Book Tables.pdf

wanghaisheng commented 6 years ago

一种新型版式文档格式的架构设计与关键技术研究

下载PDF阅读器

文档作为信息的载体，在人类历史和社会进步中发挥着重要作用。近年来随着电子技术的发展，电子文档日益普及。同时网络技术的迅速发展，手持移动设备的成本愈加低廉、性能愈加强大，使得电子文档的网络出版迈进了一个新的发展阶段。但是随着相关工作的进行，多样化的阅读终端也给网络出版带来了新的挑战。因此我们在网络出版的背景下对相关问题展开了一系列研究。　　首先，多样化的阅读终端使得人们亟需将传统的固定版式文档与流式文档进行融合。针对这个问题，在分析现有文档模型的基础上，从网络出版中固定版式文档和流式文档的局部共同点出发，根据人类在阅读图书文档时对版面分析的行为，本文提出了一种新的针对电子文档网络出版的文档模型——CEBX(Commone-Document of Blending XML)。该模型融合了必要的固定版式数据与流式信息，以达到一次制作、多平台多次利用的目的。与现有的文档模型不同，本文以版面块为基础构造文档模型，将版面块作为固定版式信息与流式信息融合的基本单位，从固定版式数据出发构造相应的流式信息，并根据需要赋予其必要的交互特性，以达到电子文档出版的需要。根据本文所提出的文档模型我们已经实现了相关的文档制作、出版流程，并在实际使用中获得很好的效果。　　其次，我们使用了XML来描述所设计的CEBX文档格式。XML作为一种处理结构化文档信息的工具，被广泛应用于数据库、网络、文档技术等各个领域。但是XML数据本身却在信息冗余量和局部访问效率上存在缺陷。为了适应网络出版、减少存储开销、缩短数据传输时间，本文提出了一种新的可查询的XML压缩方法XTrim，平衡了压缩率和查询效率。与现有方法相比，XTrim在压缩XML文档时有明显的优势。其对XMLSchema信息进行优化，根据优化后的信息对XML文档的结构信息进行最小化处理;并且根据路径对文本数据进行分块，然后根据数据所属的语种不同，利用获得的语种相关的字典对数据信息进行处理，从而获得了更好的压缩效果。特别是，本文针对XML文档中的非拉丁语种数据给出了一种基于词语构成的方法来进行压缩。为进一步提高Xtrim的压缩率，本文还提出了微型数据块优化方法，以对XML数据进行更好处理。得益于以上几种方法，XTrim可以达到比XMill更优的压缩效果。同时，为了能快速的对压缩后的XML数据进行局部访问，本文还在XTrim中提供了查询的支持，即通过＜path，pre＞方法来为压缩后XML文档建立索引。＜path，pre＞方法根据XMLSchema中的信息有效的控制了索引信息的大小，同时还可以快速的响应各种查询语句。实验结果表明，XTrim对各类XML文档都有优秀压缩效果，同时还可以有效的支持查询。　　最后，为了能够将已有资源快速的生产、制作为CEBX文档，本文还对中文图书文档的版面结构进行了分析和研究。研究的出发点与所提出的文档模型的出发点一致，即参考人们在阅读图书文档时的行为习惯，综合现有方法对图书文档进行逻辑结构识别，以满足人们在日常使用中对电子文档重排和结构抽取的需求。首先，本文对页面基础信息的识别进行研究，之后基于空白覆盖对页面进行布局划分，并依据划分的结果将页面文本组织成文本行，最后将所得到的文本行进行聚合得到页面中的段落，从而完成了版面结构智能分析方法。实验结果表明，本章所提出的方法对于中文图书文档的版面结构识别有着很好的效果。我们所提出的方法尽可能的找出版面上的“空白”，利用这些“空白”对页面进行切割，充分挖掘了页面布局信息。总体看来，这种方法综合自顶向下和自底向上的优点，有着较好的识别效果。　　本文的创新点主要体现在以下几个方面: 　　1)提出从版面块出发构造文档模型，打破了传统固定版式文档和流式文档之间的屏障。　　2)提出基于XML Schema、语言特征的XML数据压缩方法，并对其中的微型数据块进行优化处理，获得了很好的压缩效果。　　3)针对压缩后的数据，提出了＜path，pre＞方法来构造紧凑的索引结构，并完成了支持XPath的查询引擎，可以对压缩后的数据进行高效率的查询。　　4)提出了一种基于空白的版面结构智能识别方法，综合自顶向下和自底向上方法的优点，对中文图书文档的版面结构进行识别，提高了识别的准确率。　　目前CEBX已经成为版式技术产业应用联盟标准，已获得数个出版平台、阅读器的支持，正在接受评审以成为国家标准。而所提出的XML数据压缩方法和版面结构智能识别方法已经在相关的软件产品中实用，显著缩小了文档数据量、提高了文档制作的自动化水平。

作者：仇睿恒学科专业：计算机应用技术授予学位：博士学位授予单位：北京大学导师姓名：汤帜学位年度： 2010

wanghaisheng commented 6 years ago

https://app.dimensions.ai/details/publication/pub.1034782548 DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, Sheraz Ahmed 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) - Proceeding

DeepDeSRT_ Deep Learning for Detection and Structure Recognition of Tables in Document Images.pdf Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection

Understanding Tables on the Web

A Table Detection Method for PDF Documents Based on Convolutional Neural Networks

Generating Schema Labels through Dataset Content Analysis Generating Schema Labels through Dataset Content Analysis .pdf

Rule-based spreadsheet data transformation from arbitrary to relational tables https://github.com/cellsrg/tabbyxl Rule-based spreadsheet data transformation from arbitrary to relational tables.pdf A Saliency-based Convolutional Neural Network for Table and Chart Detection in Digitized Documents A Saliency-based Convolutional Neural Network for Table and Chart Detection in Digitized Documents.pdf A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks.pdf Effective and efficient Semantic Table Interpretation using TableMiner+

Effective and efficient Semantic Table Interpretation using TableMiner+ .pdf Scatteract: Automated Extraction of Data from Scatter Plots Scatteract- Automated Extraction of Data from Scatter Plots .pdf Extracting Scientific Figures with Distantly Supervised Neural Networks Extracting Scientific Figures with Distantly Supervised Neural Networks .pdf Table Detection Using Deep Learning PhD_web.pdf 2017_Deep_Table_ICDAR.pdf 回顾与展望：人工智能在图书馆的应用.pdf Dataset, ground-truth and performance metrics for table detection evaluation

wanghaisheng commented 6 years ago

Dataset, ground-truth and performance metrics for table detection evaluation .pdf

Ground-Truth and Performance Evaluation for Page Layout Analysis of Born-Digital Documents Ground-Truth and Performance Evaluation for Page Layout Analysis of Born-Digital Documents .pdf

wanghaisheng commented 6 years ago

文档图像分类解决思路 image-based( “visual similarity ) VS content (OCR) based (domain specific models are based on text )

基于图像的又分三大类：第一类：layout/structural similarity of the document images. 版面分析就是这种第二类：developing of local and/or global image descriptors 提取特征 global vs local的文档 manual 第三类；use CNNs to automatically learn and extract features from the document images 自动提取特征黑盒

Classification of Document Page Images.pdf

Phd_Thesis_Document Image Classification Combining Textual and Visual Features Document Image Classification on the Basis of Layout Information.pdf

[3] A. Dengel, R. Bleisinger, F. Fein, R. Hoch, F. Hones, and M. Malburg. Officemaid - a system for office mail analysis, interpretation and delivery. In International Workshop on Document Analysis Systems, pages 253 – 276, 1994. [4] D. Doermann, C. Shin, A. Rosenfeld, H. Kauniskangas, J. Sauvola, and M. Pietikainen. The development of a general framework for intelligent document image retrieval. In International Workshop on Document Analysis Systems, pages 605–632, 1996. [5] X. Hao, J.T.L. Wang, M.P. Bieber, and P.A. Ng. Heuristic classification of office documents. International Journal on Artificial Intelligence Tools, 7:233–265, 1995. [6] G. Maderlechner, P. Suda, and T. Bruckner. Classification of documents by form and content. In Pattern Recogniton Letters, 18 (11-13), pages 1225–31. [7] S. Murty, S. Kasif, and S. Salzberg. A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2:1–32, 1994. [8] S.L. Taylor, M. Lipshutz, andR.W. Nilson. Classification and functional decomposition of business documents. In Proceedings of the International Conference on Document Analysis and Recognition, pages 563–566, 1995

wanghaisheng commented 6 years ago

Classification of Document Page Images http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.6696&rep=rep1&type=pdf

Searching in a large heterogeneous collection ofscanned document images often produces uncertain results in part because of the size of the collection and the lack of an ability to focus queries appropriately. Searching for documents by their type is a natural way to enhance the effectiveness of document retrieval in the workplace [2], and a such system is proposed in [4]. The goal of our work is to build classifiers that can determine the type or genre of a document image. We use primarily layout features since the layout of a document contains a significant amount of information that can be used to identify a document’s type. Layout analysis is necessary since our input image has no structural definition that is immediately perceivable by a computer. Classification is thus based on “visual similarity” of the structure without reference to models of particular kinds of pages. There has been some classification work reported but most require either domain specific models [3, 5, 6, 8] or are based on text obtained by optical character recognition (OCR) [3, 6, 8].

We propose a method for using layout structures of documents (i.e., visual appearance) to facilitate the search and retrieval of a document stored in a multi-genre database by building a supervised classifier. Ideally, we need tools to automatically generate layout features that are relevant for the specific classification task at hand. Class labels for training samples can be obtained manually or by clustering examples. Once the image features and their types are obtained from a set of training images, classifiers can be built. In our experiment, we used 64 image features derived from the University of Washington Image Database I (UW-I) groundtruth [1] including the percentages of text and non-text (graphics, image, table, and ruling) zones, the presence of bold font style, font size, and density of content area measured by dividing the total content area by the page area

wanghaisheng commented 6 years ago

Real-Time Document Image Classification using Deep CNN and Extreme Learning Machines https://arxiv.org/pdf/1711.05862.pdf

Abstract—This paper presents an approach for real-time training and testing for document image classification. In production environments, it is crucial to perform accurate and (time-)efficient training. Existing deep learning approaches for classifying documents do not meet these requirements, as they require much time for training and fine-tuning the deep architectures. Motivated from Computer Vision, we propose a two-stage approach. The first stage trains a deep network that works as feature extractor and in the second stage, Extreme Learning Machines (ELMs) are used for classification. The proposed approach outperforms all previously reported structural and deep learning based methods with a final accuracy of 83.24 % on Tobacco3482 dataset, leading to a relative error reduction of 25 % when compared to a previous Convolutional Neural Network (CNN) based approach (DeepDocClassifier). More importantly, the training time of the ELM is only 1.176 seconds and the overall prediction time for 2, 482 images is 3.066 seconds. As such, this novel approach makes deep learning-based document classification suitable for large-scale real-time applications. Index Terms—Document Image Classification, Deep CNN, Convolutional Neural Network, Transfer Learning

wanghaisheng commented 6 years ago

文档图像分类数据集 Tobacco3482 dataset

http://www.cs.cmu.edu/%7Eaharley/rvl-cdip/ The RVL-CDIP Dataset

File	Size	md5sum
rvl-cdip.tar.gz	38762320458B (37GB)	d641dd4866145316a1ed628b420d8b6c
labels_only.tar.gz	6359157B (6.1MB)	9d22cb1eea526a806de8f492baaa2a57

Details

The label files list the images and their categories in the following format: path/to/the/image.tif category

where the categories are numbered 0 to 15, in the following order:

default

wanghaisheng commented 6 years ago

Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval 2015

This paper presents a new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs). In object and scene analysis, deep neural nets are capable of learning a hierarchical chain of abstraction from pixel inputs to concise and descriptive representations. The current work explores this capacity in the realm of document analysis, and confirms that this representation strategy is superior to a variety of popular hand-crafted alternatives. Experiments also show that (i) features extracted from CNNs are robust to compression, (ii) CNNs trained on non-document images transfer well to document analysis tasks, and (iii) enforcing region-specific feature-learning is unnecessary given sufficient training data. This work also makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories, useful for training new CNNs for document analysis. Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval.pdf

wanghaisheng commented 6 years ago

Analysis of Convolutional Neural Networks for Document Image Classification .pdf

wanghaisheng commented 6 years ago

2018 Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks https://arxiv.org/pdf/1801.09321.pdf code:https://github.com/rishiabhishek/document-image-classification

Abstract—In this work, a region-based Deep Convolutional Neural Network framework is proposed for document structure learning. The contribution of this work involves efficient training of region based classifiers and effective ensembling for document image classification. A primary level of ‘inter-domain’ transfer learning is used by exporting weights from a pre-trained VGG16 architecture on the ImageNet dataset to train a document classifier on whole document images. Exploiting the nature of region based influence modelling, a secondary level of ‘intra-domain’ transfer learning is used for rapid training of deep learning models for image segments. Finally, stacked generalization based ensembling is utilized for combining the predictions of the base deep neural network models. The proposed method achieves state-of-the-art accuracy of 92.2% on the popular RVL-CDIP document image dataset, exceeding benchmarks set by existing algorithms.

要点一：Document Image Processing task 包括document retrieval, information extraction and text recognition，预先对输入文档索引分类是一种有效的预处理手段比如business letters can be classified in \offer", \order" or \inquiry", and be dispatched to the correct department for processing [
要点二：文档图片分类主要挑战是inter-class similarity and intra-class variability
要点三：解决思路 image-based( “visual similarity ) VS content (OCR) based (domain specific models are based on text ) OCR 的话其中一部分要进行结构分析来选择具体的OCR模块又回到结构分析这个要点上来
要点四：7-9的文献中别人已经把deep convolutional neural networks (DCNN) 用在了文档图像分类中了
要点五：本文中主要是提出一种document region based classifiers 突出一个训练速度很快基于区域的分类器本质上是utilized to make a case for both inter-domain and intra-domain transfer learning for this problem
要点六：2007年就有人写过一篇自动分类的综述对传统方法进行了总结 [3] N. Chen and D. Blostein, “A survey of document image classification: problem statement, classifier architecture and performance evaluation,” Int. Journal of Document Analysis and Recognition, vol. 10, no. 1, pp. 1–16, 2007.
要点七：基于DL来做文档图像分类的工作主要有下面这些尝试 In[7], a deep CNN architecture with rectified linear units was trained using dropout [23] for a 10-class document image classification task, popularly known as the Tobacco3482 dataset[24]. Later, in [8], the concept of Transfer Learning [25] was used to improve the recognition accuracy on the same standard dataset by using a CNN pre-trained on a ImageNet dataset[26]. Another work in the area which used transfer learning includes [9] which also introduced region based modelling and introduced the larger 16 class RVL-CDIP dataset. In [27],results of a committee of lightweight supervised layerwise trained models on the Tobacco3482 achieved decent results without any transfer learning. In [28], extensive exploration was done on the variation of components such as architectures, image size, aspect ratio preservation, spatial pyramidal pooling, training set size among others using the AlexNet architecture and the RVL-CDIP and ANDOC datasets. In [29], run-length and fisher vector representations trained on an MLP were compared to AlexNet and GoogLeNet architectures on the RVL-CDIP dataset with deep models showing better performance. In comparison, [30] concentrated on speed by replacing the fully connected portion of the VGG architecture with extreme learning machines (ELM)
要点八：即使是CNN黑盒的方法来提取特征也存在一个模型还是多个模型并存 A combination of holistic and region based modelling for document image classification was introduced in [9]. The general idea consists of training multiple machine learning models to capture influences of holistic as well as region specific visual cues of various document classes.
要点九：从头训练一个类似 VGG 架构的 Region based training of document recognizers 耗时不可想象，先拿一个在 ImageNet训练好的VGG16 作为holistic model的起点，再以RVL-CDIP数据集做一个L1的迁移学习，得到 VGG16-Holistic；然后再这个holistic model的基础上做一个L2的迁移学习，作为 regional model的起点，再以RVL-CDIP数据集中的 Header/Footer/Left Body/Right Body 部分作为训练数据，得到这几个模型VGG16-Header/Footer/LeftBody/RightBody
要点十：使用 Stacked Generalization Schemes来融和模型用到的meta-classifiers 包括： Linear Regression、Ridge Regression、KNN、SVM、bagging(Bootstrap Aggregating (with SVM))、ELM、MLNN
要点十一：RVL-CDIP dataset.Training, Validation and Test Sets each containing 320000, 40000 and 40000 images。 The images for the Holistic dataset were initially resized to 224 × 224. For the regionbased images, the regions were extracted exactly as in [9], for the ease of comparison. The extracted regions were further resized to 224×224. Following the resizing, all datasets were standardized and the single image channel was duplicated to 3 channels for VGG16 compatibility
要点十二：As crossdomain extensions of this work, the multi-domain trained DCNN models can be expected to act as generalized text detectors and be the starting points for training models for text segmentation in domains such recognizing text in signs and number plates in Automated Driving Assistance Systems (ADAS) among others

[7] L. Kang, J. Kumar, P. Ye, Y. Li, and D. Doermann, “Convolutional neural networks for document image classification,” 22nd International Conference on Pattern Recognition (ICPR), pp. 3168–3172, 2014. [8] M. Z. Afzal, S. Capobianco, M. I. Malik., S. Marinai, T. M. Breuel, A. Dengel, and M. Liwicki, “Convolutional neural networks for document image classification,” Proc. of the 13th Int. Conf. on Document Analysis and Recognition (ICDAR), pp. 1111–1115, 2015. [9] A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deep convolutional nets for document image classification and retrieval,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 991–995. [25] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. on Knowl. and Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010. [26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” Proc. of IEEE Conf. on CVPR, pp. 248–255, 2009. [27] S. Roy, A. Das, and U. Bhattacharya, “Generalized stacking of layerwise-trained deep convolutional neural networks for document image classification,” in 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 1273–1278. [28] C. Tensmeyer and T. Martinez, “Analysis of convolutional neural networks for document image classification,” arXiv preprint arXiv:1708.03273, 2017. [29] G. Csurka, D. Larlus, A. Gordo, and J. Almazan, “What is the right way to represent document images?” arXiv preprint arXiv:1603.01076, 2016. [30] M. Z. Afzal, A. K¨olsch, S. Ahmed, and M. Liwicki, “Cutting the error by half: Investigation of very deep cnn and advanced training strategies for document image classification,” arXiv preprint arXiv:1704.03557, 2017. [31] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for lvcsr,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8614–8618. [32] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [33] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [34] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, pp. 241–259, 1992.

堆栈泛化的解释 https://www.jianshu.com/p/46ccf40222d6

wanghaisheng commented 6 years ago

https://github.com/PiSchool/enterprise-document-classification https://github.com/jcbgamboa/autodoc

wanghaisheng commented 6 years ago

Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features https://arxiv.org/abs/1710.03006

In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As a first step, the workflow involves scanning and Optical Character Recognition (OCR) of documents. Preservation of document contexts of single page scans is a major requirement in this context. To facilitate workflows involving very large amounts of paper scans, page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into multi-page documents. In a digitization project together with a German federal archive, we developed a novel approach based on convolutional neural networks (CNN) combining image and text features to achieve optimal document separation results. Evaluation shows that our PSS architecture achieves an accuracy up to 93 % which can be regarded as a new state-of-the-art for this task.

wanghaisheng commented 6 years ago

2017年4月11日
Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification https://arxiv.org/pdf/1704.03557.pdf

We present an exhaustive investigation of recent Deep Learning architectures, algorithms, and strategies for the task of document image classification to finally reduce the error by more than half. Existing approaches, such as the DeepDocClassifier, apply standard Convolutional Network architectures with transfer learning from the object recognition domain. The contribution of the paper is threefold: First, it investigates recently introduced very deep neural network architectures (GoogLeNet, VGG, ResNet) using transfer learning (from real images). Second, it proposes transfer learning from a huge set of document images, i.e. 400,000 documents. Third, it analyzes the impact of the amount of training data (document images) and other parameters to the classification abilities. We use two datasets, the Tobacco-3482 and the large-scale RVL-CDIP dataset. We achieve an accuracy of 91.13% for the Tobacco-3482 dataset while earlier approaches reach only 77.6%. Thus, a relative error reduction of more than 60% is achieved. For the large dataset RVL-CDIP, an accuracy of 90.97% is achieved, corresponding to a relative error reduction of 11.5%.

wanghaisheng commented 6 years ago

N. Chen and D. Blostein, “A survey of document image classification: problem statement, classifier architecture and performance evaluation,” Int. Journal of Document Analysis and Recognition, vol. 10, no. 1, pp. 1–16, 2007.

wanghaisheng commented 6 years ago

0颗星没有啥营养 Text Detection and Recognition in images: A survey https://arxiv.org/abs/1803.07278 Text Detection and Recognition in images- A survey .pdf

wanghaisheng commented 6 years ago

https://www.jianshu.com/p/710799b985ef 基于深度学习的图像目标检测

wanghaisheng commented 6 years ago

2016 博士论文：Document Image Classification Combining Textual and Visual Features

The pipeline of the building phase of proposed approach is shown in Figure 4.2, its three main phases can be summarized as follows: • Text extraction and analysis: OCR is employed to extract textual information from original sized document images. Texts are analyzed and for each class a dictionary is built. • Embedding phase: exploiting word position coordinates and dictionaries, relevant words are emphasized within each sub-sampled document image. • Training phase: a CNN is trained using sub-sampled document images from the previous phase. 4.2.1 Text extraction and analysis Textual information is extracted from each document image through OCR, details about the adopted OCR system, and OCR engines evaluation are provided in Section 5.4. Optical Character Recognition is a dicult task for noisy or low resolution documents [75], and thus, to discard reading errors, we preprocessed all the automatically extracted text using Natural Language dictionaries and stop-word lists. To emphasize class relevant textual content within each document image, for each class, a dictionary containing representative words is generated. This is done by collecting all the words extracted by the OCR engine, for all the images belonging to a specic class. To build the nal dictionary, we adopt the weighting formula of Pe~nas et al. [21].

4.2.2 Embedding phase Starting from the dictionaries of relevant words per classes, the aim of this phase is to embed textual information obtained from OCR within document images. We perform this to let relevant key-words information become recognizable even at low resolutions where the text is unreadable. We create a specic visual color feature for each class key-word contained in the processed image and in, at least, one of the dictionaries built in Sec. 4.2.1. The added visual feature consists of a rectangle of the class color drawn across each class key-word found in the document image. More in details, given an image, OCR is performed. The OCR engine output is composed both of the sequence of recognized words and their positions within the image. Once the words are extracted, for each word the system checks whether it belongs to one or more of class key-word dictionaries. If the word belongs only to one dictionary, a rectangle of the associated class color is drawn across it using the obtained position coordinates, otherwise if it belongs to more than one dictionary, the rectangle is divided by the number of corresponding dictionaries and each part is colored using the associated class colors. In Figure 4.3 the same documents shown in Figure 4.1 are shown after the embedding phase. Rectangles of respective classes' colors are drawn; it can be easily noted that, for documents that belong to the same class, the associated class color is the mainly used: in the rst line the red color is the most used, then, there is green, while in the third, the majority of the key-words' rectangles are blue. Experiments reported in Chapter 5 show the eectiveness of the embedding phase for these specic three classes of documents. During CNN training, document images are sub-sampled to xed dimension therefore text becomes unreadable; however, the marked key-word rectangles remain visible and allow the model to infer textual content. Not only classes information are added but keywords' positions are underlined, giving the model extra characteristics that are exploited during the classication phase

4.2.3 Training phase A deep Convolutional Neural Network is employed as classication model. Our proposal consists of using the images from the previous steps, where textual content information are transformed into visual features and stored in document images, to train the network. A common practice with CNN, is to exploit transfer learning [76]. This technique consists of pre-train a network on a large dataset, and then exploit it either as a xed feature extractor or as a ne-tuning for the adopted CNN. In the rst scenario, given a CNN a training phase on a dierent dataset is performed, after that the last fullyconnected layer is removed and the remaining Convolutional Network is treated as a xed feature extractor for the new dataset. On the other hand, the second strategy consists of ne-tune the weights of the pre-trained network by continuing the back-propagation. A popular pre-training dataset is ImageNet. ImageNet dataset [77] is composed of over 15 million labeled high-resolution images in over 22000 categories, a subset of 1.2 million of images divided in 1000 categories is used in ILSVRC ImageNet challenges as training set, networks trained with such a training set are often used to transfer learning methodology. Supported by the state-of-the-art results obtained by Harley et al. [42], we also implement transfer learning using ImageNet dataset and the CNN model of Krizhevsky et al. [49]. Implementation details are given is Section 5.1. Multiple experiments demonstrating the eectiveness of our method are reported in the next Section of this manuscript.

wanghaisheng commented 6 years ago

Open Evaluation Tool for Layout Analysis of Document Images https://arxiv.org/abs/1712.01656

This paper presents an open tool for standardizing the evaluation process of the layout analysis task of document images at pixel level. We introduce a new evaluation tool that is both available as a standalone Java application and as a RESTful web service. This evaluation tool is free and open-source in order to be a common tool that anyone can use and contribute to. It aims at providing as many metrics as possible to investigate layout analysis predictions, and also provide an easy way of visualizing the results. This tool evaluates document segmentation at pixel level, and support multi-labeled pixel ground truth. Finally, this tool has been successfully used for the ICDAR2017 competition on Layout Analysis for Challenging Medieval Manuscripts.

3https://www.digitisation.eu/tools-resources/demonstrator-platform/ 4https://transkribus.eu/wiki/index.php/REST_Interface 5https://github.com/DIVA-DIA/LayoutAnalysisEvaluator

LayoutEvaluation_1.8.129.zip Layout Evaluation User Guide.pdf http://www.primaresearch.org/tools/PerformanceEvaluation

wanghaisheng commented 6 years ago

代码：https://github.com/dhlab-epfl/dhSegment 论文：https://arxiv.org/pdf/1804.10371.pdf

dhSegment: A generic deep-learning approach for document segmentation

historical document

wanghaisheng commented 6 years ago

A probabilistic framework for handwritten text line segmentation 长达47页

论文：https://arxiv.org/abs/1805.02536 代码：

We successfully combine Expectation-Maximization algorithm and variational approaches for parameter learning and computing inference on Markov random felds. This is a general method that can be applied to many computer vision tasks. In this paper, we apply it to handwritten text line segmentation. We conduct several experiments that demonstrate that our method deal with common issues of this task, such as complex document layout or non-latin scripts. The obtained results prove that our method achieve state-of-the-art performance on different benchmark datasets without any particular fine tuning step.

ICDAR 2009 and 2013 handwriting segmentation contest datasets. These datasets contain regular text documents where the text is the main part of the page. In general the documents are free of graphical or non-text elements although some of them may contain small noise. ICDAR 2009 dataset is composed of 200 test images with 4043 text lines. The documents contain the same extract of text written by several writters in several languages (English, German, Greek and French). ICDAR 2013 dataset is an update of the previous one. The dataset contains a set of 150 test images with 2649 text lines also depicted by different writers and in several languages. New features comprise the addition of new more complex languages as Indian Bangla, and new layouts as multi-paragraph and complex skewed and cramped documents. Figure 5 shows some examples of documents from this dataset.

On the other hand, we evaluate on the documents of the George Washington database [53]. This database is composed of 20 gray-scale images from the George Washington Papers at the Library of Congress dated from the 18th century. The documents are written in English language in a longhand script. This database adds a set of different challenges with respect to the previous one due to the old script style, overlapping lines and a more complex layout. Also, documents may contain non-text elements as stamps or line separators. We show several examples in Figure 6. We use the same ground truth introduced for this task in [35] since there is not public ground truth for the task of line segmentation. For this reason, it is not possible to compare with any other methods apart from previous works and [35].

Last, we test our method in a collection of administrative documents with handwritten annotations. This is a more heterogeneous and complex dataset, since it contains documents with multiple text regions, each of them with different characteristics as orientation and writing style. The collection includes letter-type documents, annotations in machine-printed documents, information from bank checks and other documents with complex layouts. The set of documents in the dataset is the result of the application of a previous machine-printed text separation [54], in order to remove all possible not handwritten components. We apply the line segmentation algorithm on the handwritten layer without any particular filtering process The dataset is written in English and French languages and is composed of 433 document images. We show some examples of documents in Figure 7.

wanghaisheng commented 6 years ago

Script Identification in Natural Scene Image and Video Frame using Attention based Convolutional-LSTM Network

论文：https://arxiv.org/abs/1801.00470 代码：

Script identification plays a significant role in analysing documents and videos. In this paper, we focus on the problem of script identification in scene text images and video scripts. Because of low image quality, complex background and similar layout of characters shared by some scripts like Greek, Latin, etc., text recognition in those cases become challenging. Most of the recent approaches generally use a patch-based CNN network with summation of obtained features, or only a CNN-LSTM network to get the identification result. Some use a discriminative CNN to jointly optimize mid-level representations and deep features. In this paper, we propose a novel method that involves extraction of local and global features using CNN-LSTM framework and weighting them dynamically for script identification. First, we convert the images into patches and feed them into a CNN-LSTM framework. Attention-based patch weights are calculated applying softmax layer after LSTM. Then we do patch-wise multiplication of these weights with corresponding CNN to yield local features. Global features are also extracted from last cell state of LSTM. We employ a fusion technique which dynamically weights the local and global features for an individual patch. Experiments have been done in two public script identification datasets, SIW-13 and CVSI2015. The proposed framework achieves superior results in comparison to conventional methods.

wanghaisheng commented 6 years ago

https://github.com/PRImA-Research-Lab/PAGE-XML

wanghaisheng commented 6 years ago

http://www.music.mcgill.ca/~ich/classes/mumt611_07/Evaluation/liang97performance.pdf performance evalutation of document layout analysis algorithm on the uw-dataset

wanghaisheng commented 6 years ago

http://www.mdpi.com/2313-433X/3/4/62/htm DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

https://github.com/DocCreator/DocCreator Abstract: Most digital libraries that provide user-friendly interfaces, enabling quick and intuitive access to their resources, are based on Document Image Analysis and Recognition (DIAR) methods. Such DIAR methods need ground-truthed document images to be evaluated/compared and, in some cases, trained. Especially with the advent of deep learning-based approaches, the required size of annotated document datasets seems to be ever-growing. Manually annotating real documents has many drawbacks, which often leads to small reliably annotated datasets. In order to circumvent those drawbacks and enable the generation of massive ground-truthed data with high variability, we present DocCreator, a multi-platform and open-source software able to create many synthetic image documents with controlled ground truth. DocCreator has been used in various experiments, showing the interest of using such synthetic images to enrich the training stage of DIAR tools.

wanghaisheng commented 6 years ago

http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_600555_WP3_D3.1_Annex1.pdf Succeed List of Tools Succeed_600555_WP3_D3.1_Annex1.pdf

wanghaisheng commented 6 years ago

A free cloud service for OCR 2016 https://gupea.ub.gu.se/bitstream/2077/42228/1/gupea_2077_42228_1.pdf

wanghaisheng commented 6 years ago

http://www.primaresearch.org/

http://www.europeana-newspapers.eu/public-materials/deliverables/

https://github.com/KBNLresearch National Library of the Netherlands / Research

wanghaisheng commented 6 years ago

http://ceng.anadolu.edu.tr/CV/EDLines/demo.aspx

lsd hough EDLines: Edge Drawing (ED)-Based Real-Time Line Segment Detection https://github.com/frotms/line_detector

default

wanghaisheng commented 6 years ago

Improving Document Clustering by Eliminating Unnatural Language https://arxiv.org/pdf/1703.05706.pdf Technical documents contain a fair amount ofunnatural language, such as tables, formulas,pseudo-codes, etc. Unnatural language can bean important factor of confusing existing NLPtools. This paper presents an effective methodof distinguishing unnatural language from naturallanguage, and evaluates the impact of unnaturallanguage detection on NLP tasks suchas document clustering. We view this problemas an information extraction task and builda multiclass classification model identifyingunnatural language components into four categories.First, we create a new annotated corpusby collecting slides and papers in various formats,PPT, PDF, and HTML, where unnaturallanguage components are annotated into fourcategories. We then explore features availablefrom plain text to build a statistical model thatcan handle any format as long as it is convertedinto plain text. Our experiments show that removingunnatural language components givesan absolute improvement in document clusteringup to 15%. Our corpus and tool are publiclyavailable.

wanghaisheng commented 6 years ago

https://github.com/RaymondMcGuire/BOOK-CONTENT-SEGMENTATION-AND-DEWARPING

Using FCN to segment the book's content and background, then dewarping the pages,

wanghaisheng commented 6 years ago

Scribble Based Interactive Page Layout Segmentation using Gabor Filter This version uses the GrabCut implementation from OpenCV. This repository presents the code of the paper titled "Scribble Based Interactive Page Layout Segmentation Using Gabor Filter" published in ICFHR2016.

https://github.com/majeek/scribbleSegmentation

wanghaisheng commented 6 years ago

2017 LAREX – A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books

wanghaisheng commented 6 years ago

https://github.com/fascarzacs/historicalDocumentsPageSegmentation

wanghaisheng / awesome-ocr

版面分析/版式分析入门 #86