Export issues - Githubissues

manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.

GNU General Public License v3.0

1.6k stars 188 forks source link

Export issues #599

Open HunterZ opened 2 years ago

HunterZ commented 2 years ago

Running into a number of issues trying to export the results of painstakingly fine-tuning the hOCR for a PDF.

First, attempting to export directly to PDF from gImageReader-gtk 3.3.1 under Debian, or from gImageReader-qt latest CI under Windows with the PoDoFp backend results in the following error:

I suspect this is because I am using custom OTF fonts that are installed in each OS.

Second, attempting to export from gImageReader-qt latest CI under Windows with the QPrinter backend results in the text getting chopped up and duplicated in weird ways. Compare the gImageReader hOCR tree for my first page with the object list from the exported PDF:

Third, exporting to ODT from gImageReader-gtk 3.3.1 under Debian (not tested under Windows) results in a couple of issues:

Text gets line wrapped if the OCR text doesn't fit perfectly within the defined bounding boxes
Individual line alignment gets lost when multiple lines are grouped under a paragraph in the hOCR tree
Edit: Everything also seems shifted down (even with a baseline of 0 0), although I can't prove whether this is gImageReader's or LibreOffice's fault:

As things currently stand, I don't see any way to get a viable PDF out of gImageReader, even indirectly via ODT->PDF, because all of the export methods either fail outright, produce garbled output, and/or discard aspects of my painstakingly hand-aligned custom font text.

MicahBird commented 2 years ago

I'm also experiencing this issue on Fedora, but this line in your issue is key:

I suspect this is because I am using custom OTF fonts that are installed in each OS.

Unfortunately it seems that exporting with custom fonts is finicky, as whenever I try to export with the Sans font family is gives the same The PDF export failed: ePdfError_UnsupportedFontFormat.

However, when exporting with Arial or any font in the Liberation font family, it works! Hope this helps :)

HunterZ commented 1 year ago

I have some more information to share:

First, I tried converting all ODF fonts I'm using to TTF (via FontForge then a Python otf2ttf script) and replacing them in my OS. Unfortunately this didn't fix it, but I was able to narrow things down to two font families.

On a hunch, I used sed to change one of the font names in XML from that of the font family to that of one of the specific weight variants (medium/semibold/bold) - and it worked.

The problem with this workaround is that gImageReader only lets you pick a font family from its GUI, and not a weight variant. Both of these font families have 6 variants: medium/semibold/bold weights, each with regular and italic slant variants.

gImageReader was able to work out the italic variant when I picked a specific weight via XML, but this means that I'll probably have to specify the bold weight via XML hacking whenever I want bold, or the regular weight when I want non-bold.

...or maybe I can use FontForge to rearrange the font family naming to a taxonomy that is hopefully better supported by gImageReader?

HunterZ commented 1 year ago

Another update:

I was able to solve it by using FontForge to rename the medium variants' PostScript Names as follows:

XYZ-Medium => XYZ
XYZ-MediumItalic => XYZ-Italic

Once I did this, exported, and reinstalled the fonts, gImageReader was able to use the family name to derive regular, italic, bold, and bold+italic variants via its own flags.

The takeaway here is that gImageReader apparently only supports fonts that have a variant whose PS Name has no dashed suffix, which it then uses to derive the corresponding -Italic, -Bold, and -BoldItalic variant names. A font whose "base" variant is -Medium and base italic variant is -MediumItalic just doesn't work.

manisandro commented 1 year ago

I suspect this is a limitation in PoDoFo.