tuub / kitodo-mediaserver

Kitodo Mediaserver
GNU General Public License v3.0
7 stars 6 forks source link

Convert ABBYY OCR file format to ALTO #220

Closed ponchofiesta closed 5 years ago

ponchofiesta commented 5 years ago

As Kitodo.Presentation doesn't support the old ABBYY Finereader OCR file format, we need an action to convert these files.

ponchofiesta commented 5 years ago

This adds the IOcrConverter interface and an AbbyyToAltoOcrConverter implementation using the converter package from https://github.com/Mewel/abbyy-to-alto. The Maven repo described there didn't work for me. So I forked it to our namespace and used Jitpack.io to get a package out of the Git repo.

The converter checks the source file format with a simple check ("<alto "=ALTO format, "<document "=Finereader format). It skips files in correct format and drops files in unsupported formats.

You can add this action to the importer like this:

importer:
  actionsBeforeIndexing:
    - abbyyToAltoOcrConvertAction
   # Maybe followed by actions using these ALTO files.
    - addFullPdfToMetsAction
    - preproduceFullPdfFileConvertAction