Awesome OCR
This list contains links to great software tools and libraries and literature
related to Optical Character Recognition
(OCR).
Contributions are welcome, as is feedback.
Software
OCR engines
- tesseract - The definitive Open Source OCR engine
Apache 2.0
- EasyOCR - OCR engine built on PyTorch by JaidedAI,
Apache 2.0
- ocropus - OCR engine based on LSTM,
Apache 2.0
- ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
- kraken - Ocropus fork with sane defaults
- gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
- Ocrad - The GNU OCR.
GPL
- ocular - Machine-learning OCR for historic documents
- SwiftOCR - fast and simple OCR library written in Swift
- attention-ocr - OCR engine using visual attention mechanisms
- RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
- simple-ocr-opencv and its fork - A simple pythonic OCR engine using opencv and numpy
- Calamari - OCR Engine based on OCRopy and Kraken
- doctr - A seamless & high-performing OCR library powered by Deep Learning
Older and possibly abandoned OCR engines
- Clara OCR - Open source OCR in C
GPL
- Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
- Eye - an experimental Java OCR (image-to-text) application
- kognition - An omnifont OCR software for KDE
- OCRchie - Modular Optical Character Recognition Software
- ocre - o.c.r. easy
- xplab - A GTK 2 tool for pattern matching
- hebOCR - Hebrew character recognition library (previously named hocr, see Wikipedia article)
GPL
OCR file formats
hOCR
- hocr-tools - Tools for doing various useful things with hOCR files,
Apache 2.0
- hocr-spec - hOCR 1.2 specification
- ocr-transform - CLI tool to convert between hOCR and ALTO,
MIT
- hocr-parser - hOCR Specification Python Parser
- hOCRTools - hOCR to ALTO conversion XSLT
ALTO XML
TEI
- TEI-OCR - TEI customization for OCR generated layout and content information
- TEI SIG on Libraries - Best Practices for TEI in Libraries
- GDZ - METS/TEI-based GDZ document format
PAGE XML
OCR CLI
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
- Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
- Ocrocis - Project manager interface for Ocropy, see also external project homepage
- tesseract-recognize - Tesseract-based tool that outputs result in Page XML format (docker image).
OCR GUI
- moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
- qt-box-editor - QT4 editor of tesseract-ocr box files.
- ocr-gt-tools - Client-Server application for editing OCR ground truth.
- Paperwork - Using scanners and OCR to grep paper documents the easy way.
- Paperless - Scan, index, and archive all of your paper documents.
- gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
- VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
- PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
- OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
- PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
- LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
- archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in @jbaiter/archiscribe-corpus.
- nw-page-editor - Simple app for visual editing of Page XML files. Provides desktop and server docker-based versions.
OCR Preprocessing
OCR as a Service
OCR evaluation
OCR libraries by programming language
Crystal
Elixir
- tesseract_ocr - Elixir library wrapping the tesseract executable.
Go
- gosseract - Golang OCR library, wrapping Tesseract-ocr.
Java
- Tess4J - Java Native Access bindings to Tesseract.
- tess-two - Tools for compiling Tesseract on Android and Java API.
.Net
Object Pascal
PHP
Python
- pytesseract - A Python wrapper for Google Tesseract.
- pyocr - A Python wrapper for Tesseract and Cuneiform.
- ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
- tesserocr - A Python wrapper for the tesseract-ocr API
Javascript
- ocracy - pure javascript lstm rnn implementation based on ocropus
- gocr.js - Javascript port (emscripten) of gocr
- ocrad.js - Javascript port (emscripten) of ocrad
- tesseract.js - Javascript port (emscripten) of Tesseract
- node-tesseract-ocr - A simple wrapper for the Tesseract OCR package.
- node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.
Ruby
- rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
- ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
- ocr_space - API wrapper for free ocr service ocr.space. Includes CLI
Rust
- tesseract.rs - Rust bindings for tesseract OCR.
- leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.
R
Swift
- Tesseract OCR iOS - Swift and Objective-C wrapper for Tesseract OCR.
- SwiftOCR - Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes.
OCR training tools
- glyph-miner - A system for extracting glyphs from early typeset prints
- ocrodeg - Document image degradation for OCR data augmentation
Datasets
Ground Truth
- archiscribe-corpus - >4,200 lines transcribed from 19th Century German prints via archiscribe
CC-BY 4.0
- CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for PoCoTo
- Rescribe - Transcriptions of Caroline Minuscule Manuscripts
PDM 1.0
- CLTK - Corpora from Classical Language Toolkit
PDM 1.0
- DIVA-HisDB - 150 pagesPAGE-XML of three medieval manuscripts
CC-BY-NC 3.0
- EarlyPrintedBooks - ~8,800 lines from several early printed books
CC-BY-NC-SA 4.0
- EEBO-TCP - 25,363 EEBO documents transcribed by TCP
PDM 1.0
- ECCO-TCP - 2,188 ECCO documents transcribed by TCP
PDM 1.0
- eMOP-TCP - 2,188 ECCO-TCP documents, cleaned up by eMOP
PDM 1.0
- Evans-TCP - 4,977 Evans documents transcribed by TCP
- FDHN - Finnish Digitised Historical Newspapers, Paper, (free) registration required, Terms of Use
- FROC-MSS - 4 Old French Medieval Manuscripts
CC-BY 4.0
- GERMANA - 764 Spanish manuscript pages, (free) registration required
non-commercial use only
- GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin
CC-BY 4.0
- imagessan - Sanskrit images & ground truth (Devanagari script)
- IMPACT-BHL - 2,418 pagesPAGE-XML from the Biodiversity Heritage Library, XML@GitHub
CC-BY 3.0
- IMPACT-BL - 294 pagesPAGE-XML from the British Library, (free) registration required
PDM 1.0
- IMPACT-BNE - 215 pagesPAGE-XML from the National Library of Spain, (free) registration required, XML@GitHub
CC-BY-NC-SA 4.0
- IMPACT-BNF - 151 pagesPAGE-XML from the National Library of France, (free) registration required
CC-BY-NC-SA 4.0
- IMPACT-KB - 142 pagesPAGE-XML from the National Library of the Netherlands
CC-BY 4.0
- IMPACT-NKC - 187 pagesPAGE-XML from the Czech National Library, (free) registration required
CC-BY-NC-SA 4.0
- IMPACT-NLB - 19 pagesPAGE-XML from the National Library of Bulgaria, (free) registration required
CC-BY-NC-ND 4.0
- IMPACT-NUK - 209 pagesPAGE-XML from the National Library of Slovenia, (free) registration required
CC-BY-NC-SA 4.0
- IMPACT-PSNC - 478 pagesPAGE-XML from four Polish digital libraries, XML@GitHub
CC-BY 3.0
- LascivaRoma/lexical - Transcription of 19th century lexical resources for Latin learning
- MJSynth - 9m synthetic images covering 90k English words
- OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via Text+Berg digital
CC-BY 4.0
- OCR-D - 180 pagesPAGE-XML of German historical prints from OCR-D
CC-BY-SA 4.0
- OCR_GS_Data - Double-checked Arabic Gold Standard from OpenITI
- old-books - 322 old books from Project Gutenberg
GPL 3.0
- PRImA-ENP - 528 pagesPAGE-XML historic newspapers from Europeana Newspapers, (free) registration required
PDM 1.0
- RODRIGO - 853 Spanish manuscript pages, (free) registration required
non-commercial use only
- Toebler-OCR - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch
Literature
OCR-related publication and link lists
Blog Posts and Tutorials
OCR Showcases
- abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
- cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
- MathOCR - A printed scientific document recognition system, pre-alpha
Academic articles
2011 and before
2012
2013
2014
2015
- TypeWright: An Experiment in Participatory Curation (2015) Bilansky
- On crowd-sourcing OCR postcorrection
- Benchmarking of LSTM Networks (2015) Breuel
- Recognition of Historical Greek Polytonic Scripts Using LSTM (2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
- A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015) Karayil, Ul-Hasan, Breuel
- A Sequence Learning Approach for Multiple Script Identification (2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel
2016
2017
2018