images with captions as image: how to deal with it?

myrmoteras commented 4 years ago

in this case, the captions are part of the image. Is there a chance to run the OCR engine to get it out?

022DD0613F371533FF91A424687DFFB2

gsautter commented 4 years ago

Got it ... the captions are not rendered over the images by the PDF proper, but a direct part of the bitmap ... in which case they would be IMF words in a "label" text stream and easy to mark as a caption, but this scenario is a lot harder. Right now, we don't have any options running the OCR engine on bitmap images in IMFs derived from born-digital PDFs, but it should be possible (with some effort) to integrate such functionality. Question is how frequent such cases are, as adding a function for a handful of PDFs might not make sense.

myrmoteras commented 4 years ago

this is the first one - but it might be part of a journal that might come up as a collaborative project. Trying to explore right now with my Taiwanese and Malacological colleagues. See also https://twitter.com/WRMarineSpecies/status/1214495323554566145?s=20

myrmoteras commented 4 years ago

so, right now, if there is no tool at hand, wait and lets see what we get

gsautter / goldengate-imagine

images with captions as image: how to deal with it? #837