Wrong characters / difference between extraction and display

kermitt2 / pdfalto

PDF to XML ALTO file converter

GNU General Public License v2.0

216 stars 70 forks source link

Hi @keto33 !

The goal of pdfalto is to extract and normalize typescript documents, more precisely the text layer and layout information. It's not performing OCR. So if the text is awkM'ard in the text layer of the PDF (due to bad OCR), this is the text to be extracted by pdfalto.

If the PDF has only image or bad OCR, the idea is to use OCR or re-OCRize the document before applying pdfalto, e.g. via a user pipeline, selecting the appropriate OCR.

The only case I am considering OCR in pdfalto is to resolve UTF code for loaded fonts and for special characters where we only have a glyphs (bitmap) of characters, so a very restricted and targeted usage of a custom OCR (no progress on this however since a few year :D ).

kermitt2 / pdfalto

Wrong characters / difference between extraction and display #160