nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.58k stars 372 forks source link

Add feature to merge a hocr file into a existing pdf file #255

Closed maneau closed 9 months ago

maneau commented 9 months ago

The main objective is to keep the PDF small. Tesseract regenerates the PDF from 300DPI screenshots. On the other hand, the original PDF is lost once ocerized, so any original PDF properties or formats are lost too. To do this, we parse the hocr file and go through the PDF pages to write the occluded words invisibly.

The method mergeHocrIntoAPdf is add to PdfBoxUtilities.

PdfBoxUtilities.mergeHocrIntoAPdf(outputbase1 + ".hocr", pdfFilename, outputbase2, false);

nguyenq commented 9 months ago

@maneau

I see some warnings when running the unit tests. Is it a cause of concern?

Dec 03, 2023 8:22:43 AM org.apache.fontbox.ttf.GlyphSubstitutionTable readLookupTable SEVERE: The expected SubstFormat for ExtensionSubstFormat1 subtable is 6 but should be 1

maneau commented 9 months ago

It seem's to be a pdfbox regression in version 3.0. It doesn't appear on 2.0. https://issues.apache.org/jira/browse/PDFBOX-5689