Open myrmoteras opened 4 years ago
Got it ... the captions are not rendered over the images by the PDF proper, but a direct part of the bitmap ... in which case they would be IMF words in a "label" text stream and easy to mark as a caption, but this scenario is a lot harder. Right now, we don't have any options running the OCR engine on bitmap images in IMFs derived from born-digital PDFs, but it should be possible (with some effort) to integrate such functionality. Question is how frequent such cases are, as adding a function for a handful of PDFs might not make sense.
this is the first one - but it might be part of a journal that might come up as a collaborative project. Trying to explore right now with my Taiwanese and Malacological colleagues. See also https://twitter.com/WRMarineSpecies/status/1214495323554566145?s=20
so, right now, if there is no tool at hand, wait and lets see what we get
in this case, the captions are part of the image. Is there a chance to run the OCR engine to get it out?
https://zenodo.org/record/3600079
022DD0613F371533FF91A424687DFFB2