.. image:: https://github.com/mittagessen/kraken/actions/workflows/test.yml/badge.svg :target: https://github.com/mittagessen/kraken/actions/workflows/test.yml
kraken is a turn-key OCR system optimized for historical and non-Latin script material.
kraken's main features are:
Right-to-Left <https://en.wikipedia.org/wiki/Right-to-left>
_, BiDi <https://en.wikipedia.org/wiki/Bi-directional_text>
_, and Top-to-Bottom
script supportALTO <https://www.loc.gov/standards/alto/>
_, PageXML, abbyyXML, and hOCR
outputPublic repository <https://zenodo.org/communities/ocr_models>
_ of model fileskraken only runs on Linux or Mac OS X. Windows is not supported.
The latest stable releases can be installed from PyPi <https://pypi.org>
_:
::
$ pip install kraken
If you want direct PDF and multi-image TIFF/JPEG2000 support it is necessary to
install the pdf
extras package for PyPi:
::
$ pip install kraken[pdf]
or install pyvips
manually with pip:
::
$ pip install pyvips
Conda environment files are provided for the seamless installation of the main branch as well:
::
$ git clone https://github.com/mittagessen/kraken.git $ cd kraken $ conda env create -f environment.yml
or:
::
$ git clone https://github.com/mittagessen/kraken.git $ cd kraken $ conda env create -f environment_cuda.yml
for CUDA acceleration with the appropriate hardware.
Finally you'll have to scrounge up a model to do the actual recognition of characters. To download the default model for printed French text and place it in the kraken directory for the current user:
::
$ kraken get 10.5281/zenodo.10592716
A list of libre models available in the central repository can be retrieved by running:
::
$ kraken list
Recognizing text on an image using the default parameters including the prerequisite steps of binarization and page segmentation:
::
$ kraken -i image.tif image.txt binarize segment ocr
To binarize a single image using the nlbin algorithm:
::
$ kraken -i image.tif bw.png binarize
To segment an image (binarized or not) with the new baseline segmenter:
::
$ kraken -i image.tif lines.json segment -bl
To segment and OCR an image using the default model(s):
::
$ kraken -i image.tif image.txt segment -bl ocr -m catmus-print-fondue-large.mlmodel
All subcommands and options are documented. Use the help
option to get more
information.
Have a look at the docs <https://kraken.re>
_.
These days kraken is quite closely linked to the eScriptorium <https://gitlab.com/scripta/escriptorium/>
project developed in the same eScripta research
group. eScriptorium provides a user-friendly interface for annotating data,
training models, and inference (but also much more). There is a gitter channel <https://gitter.im/escripta/escriptorium>
that is mostly intended for
coordinating technical development but is also a spot to find people with
experience on applying kraken on a wide variety of material.
kraken is developed at the École Pratique des Hautes Études <https://www.ephe.psl.eu>
, Université PSL <https://www.psl.eu>
.
.. container:: twocol
.. container::
.. image:: https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg
:width: 100
:alt: Co-financed by the European Union
.. container::
This project was partially funded through the RESILIENCE project, funded from
the European Union’s Horizon 2020 Framework Programme for Research and
Innovation.
.. container:: twocol
.. container::
.. image:: https://projet.biblissima.fr/sites/default/files/2021-11/biblissima-baseline-sombre-ia.png
:width: 400
:alt: Received funding from the Programme d’investissements d’Avenir
.. container::
Ce travail a bénéficié d’une aide de l’État gérée par l’Agence Nationale de la
Recherche au titre du Programme d’Investissements d’Avenir portant la référence
ANR-21-ESRE-0005 (Biblissima+).