Closed ponchofiesta closed 5 years ago
This adds the IOcrConverter interface and an AbbyyToAltoOcrConverter implementation using the converter package from https://github.com/Mewel/abbyy-to-alto. The Maven repo described there didn't work for me. So I forked it to our namespace and used Jitpack.io to get a package out of the Git repo.
The converter checks the source file format with a simple check ("<alto "=ALTO format, "<document "=Finereader format). It skips files in correct format and drops files in unsupported formats.
You can add this action to the importer like this:
importer:
actionsBeforeIndexing:
- abbyyToAltoOcrConvertAction
# Maybe followed by actions using these ALTO files.
- addFullPdfToMetsAction
- preproduceFullPdfFileConvertAction
As Kitodo.Presentation doesn't support the old ABBYY Finereader OCR file format, we need an action to convert these files.