hot to fix a letter disappearing in the conversion process https://tb.plazi.org/GgServer/summary/F313FF8A4B7EFF90E8627409FF8FFFD0 - Githubissues

plazi / GoldenGATE-Imagine

A GUI Tool For Freeing Text and Data from PDF Documents

Other

5 stars 0 forks source link

hot to fix a letter disappearing in the conversion process https://tb.plazi.org/GgServer/summary/F313FF8A4B7EFF90E8627409FF8FFFD0 #24

Open myrmoteras opened 2 years ago

myrmoteras commented 2 years ago

https://tb.plazi.org/GgServer/html/0F2A87F24B7CFF92ED9D7599F97CFC59

in the original text, this letter B is present

B. terrestris (Linnaeus, 1758)

is there a way to fix this?

gsautter commented 2 years ago

Well, looking at the IMF, this document appears to somehow have been loaded as born-digital (there is no OCR overlay), despite the presence of scans ... on top of that, the double pages should have been split up down the middle to produce individual pages ... will try and see if this is adjustable in some kind of way.

Just out of curiosity, what's the significance of this document?

gsautter commented 2 years ago

Just decoded this one as "Double Pages" to get the pages cut up down the middle ... good news is that the letters are all there, including that "B", bad news is that the scans are so offset to the the right that the left pages have their right edges cut off because they are beyond the middle ... bit of a nightmare, this one, will try and find a way to get the lost words back in.

myrmoteras commented 2 years ago

This is all ok. Just the B. I loaded this a scanned pdf with text. The pages have been perfectly split in two. Are we speaking of the same?

myrmoteras commented 2 years ago

All worked well including QC for the article

gsautter commented 2 years ago

For what it looks like, specifically (and quite pathologically) the italics "B" has a font problem in this one ... further, I tend to think you must have loaded the PDF as "scanned, text digitized" for this issue to arise, and the text having been rendered onto the page images also supports this. However, said "scanned, text digitized" option is for scanned PDFs whose page images have been deconstructed into layers, and the text layer replaced with rendered PDF text, so there is no more text in the scans proper ... and this is not the case here, as we have the text still inside the scans, so the normal "scanned" option with usage of embedded OCR is the decoding mode of choice.

gsautter commented 2 years ago

Anyway, I've inserted the "B"s in the treatment titles now where they were missing ... not yet within the treatment texts, though, as that is yet a good deal more finicky.

myrmoteras commented 2 years ago

Whatever, we need to write down how to use scanned pdfs that have been OCRed in ABBYY, saved as PDF with text.

gsautter commented 2 years ago

Easy enough, here's the four options for PDFs:

"born-digital": born-digital PDFs, e.g. ZooKeys, Zootaxa
"scanned, text digitized": scanned PDFs whose text is removed from the page image and replaced with digitally rendered text (rare, only seen this in a few of the Lestid articles)
"scanned, text vectorized": scanned PDFs whose text is removed from the page image and replaced with a vectorized text layer (DjVu encoding, and rare, only seen this in a few of the Lestid articles)
"scanned": scanned PDFs whose text is still in the page images (as a dedicated layer or not), with or without embedded OCR (the usual case, and what you want to use for ABBYY output)

gsautter commented 2 years ago

I uploaded an alternative version now, for comparison: https://tb.plazi.org/GgServer/summary/CD4BFF80FFF7DF7C3927FF906739FF89

No worries, it doesn't go to Zenodo, GBIF, SIB, or anywhere, courtesy of the export blacklisting feature in the transit authority (home of the gatekeeper).