Closed maneau closed 9 months ago
@maneau
I see some warnings when running the unit tests. Is it a cause of concern?
Dec 03, 2023 8:22:43 AM org.apache.fontbox.ttf.GlyphSubstitutionTable readLookupTable SEVERE: The expected SubstFormat for ExtensionSubstFormat1 subtable is 6 but should be 1
It seem's to be a pdfbox regression in version 3.0. It doesn't appear on 2.0. https://issues.apache.org/jira/browse/PDFBOX-5689
The main objective is to keep the PDF small. Tesseract regenerates the PDF from 300DPI screenshots. On the other hand, the original PDF is lost once ocerized, so any original PDF properties or formats are lost too. To do this, we parse the hocr file and go through the PDF pages to write the occluded words invisibly.
The method mergeHocrIntoAPdf is add to PdfBoxUtilities.
PdfBoxUtilities.mergeHocrIntoAPdf(outputbase1 + ".hocr", pdfFilename, outputbase2, false);