Open myrmoteras opened 6 years ago
Well, conversion proper does work, and beautifully so ... the images in question get decoded and show up in the page images just like supposed to, and the respective full-resolution supplements are there as well.
The problem is that fancy yellow angled line around the top left corner of each and every page. While a nice layout gimmick, its bounding rectangle overlaps with the images, which prompts layout analysis to mistake the combination of the two for a larger combined bitmap and vector graphics figure, which ends up marked as a graphics
rather than an image
region. To make things worse, said graphics
region also turns any words it happens to span into a label
typed text stream (easily recognized from the yellow rather than gray word bounding boxes in the GGI screenshot).
You can easily take this apart and salvage the situation like this:
graphics
regionNote to myself: to prevent such layout analysis errors,
(a) refine page layout artwork detection (recently introduced to prevent page content flipping from wrecking havoc to graphics) to not only take into account the bounding box of graphics, but the very rendering command sequence (to increase precision) if bounding box exceeds dimensions of plain horizontal lines or bars (use TreeMap
with size adaptive Comparator
to implement this).
(b) exclude page layout graphics from formation of path clusters (keep them separate), and
(c) also exclude page layout artwork graphics from detection of coherent graphics objects
In this book, the conversion does not work, despite that the image can be selected and copied from the pdf (right)
Struwe_50MajorTempPlantFamilies2017.pdf