PDF conversion does not extract images properly

myrmoteras commented 6 years ago

In this book, the conversion does not work, despite that the image can be selected and copied from the pdf (right)

Struwe_50MajorTempPlantFamilies2017.pdf

gsautter commented 6 years ago

Well, conversion proper does work, and beautifully so ... the images in question get decoded and show up in the page images just like supposed to, and the respective full-resolution supplements are there as well.

The problem is that fancy yellow angled line around the top left corner of each and every page. While a nice layout gimmick, its bounding rectangle overlaps with the images, which prompts layout analysis to mistake the combination of the two for a larger combined bitmap and vector graphics figure, which ends up marked as a graphics rather than an image region. To make things worse, said graphics region also turns any words it happens to span into a label typed text stream (easily recognized from the yellow rather than gray word bounding boxes in the GGI screenshot).

gsautter commented 6 years ago

You can easily take this apart and salvage the situation like this:

remove the graphics region
drag a box around the text and use "Mark Block"
link the text back into the main document text stream
drag a box around the actual image and use "Mark Figure"

gsautter commented 6 years ago

Note to myself: to prevent such layout analysis errors, (a) refine page layout artwork detection (recently introduced to prevent page content flipping from wrecking havoc to graphics) to not only take into account the bounding box of graphics, but the very rendering command sequence (to increase precision) if bounding box exceeds dimensions of plain horizontal lines or bars (use TreeMap with size adaptive Comparator to implement this). (b) exclude page layout graphics from formation of path clusters (keep them separate), and (c) also exclude page layout artwork graphics from detection of coherent graphics objects

gsautter / goldengate-imagine

PDF conversion does not extract images properly #426