Open myrmoteras opened 2 years ago
Well, looking at the IMF, this document appears to somehow have been loaded as born-digital (there is no OCR overlay), despite the presence of scans ... on top of that, the double pages should have been split up down the middle to produce individual pages ... will try and see if this is adjustable in some kind of way.
Just out of curiosity, what's the significance of this document?
Just decoded this one as "Double Pages" to get the pages cut up down the middle ... good news is that the letters are all there, including that "B", bad news is that the scans are so offset to the the right that the left pages have their right edges cut off because they are beyond the middle ... bit of a nightmare, this one, will try and find a way to get the lost words back in.
This is all ok. Just the B. I loaded this a scanned pdf with text. The pages have been perfectly split in two. Are we speaking of the same?
All worked well including QC for the article
For what it looks like, specifically (and quite pathologically) the italics "B" has a font problem in this one ... further, I tend to think you must have loaded the PDF as "scanned, text digitized" for this issue to arise, and the text having been rendered onto the page images also supports this. However, said "scanned, text digitized" option is for scanned PDFs whose page images have been deconstructed into layers, and the text layer replaced with rendered PDF text, so there is no more text in the scans proper ... and this is not the case here, as we have the text still inside the scans, so the normal "scanned" option with usage of embedded OCR is the decoding mode of choice.
Anyway, I've inserted the "B"s in the treatment titles now where they were missing ... not yet within the treatment texts, though, as that is yet a good deal more finicky.
Whatever, we need to write down how to use scanned pdfs that have been OCRed in ABBYY, saved as PDF with text.
Easy enough, here's the four options for PDFs:
I uploaded an alternative version now, for comparison: https://tb.plazi.org/GgServer/summary/CD4BFF80FFF7DF7C3927FF906739FF89
No worries, it doesn't go to Zenodo, GBIF, SIB, or anywhere, courtesy of the export blacklisting feature in the transit authority (home of the gatekeeper).
https://tb.plazi.org/GgServer/html/0F2A87F24B7CFF92ED9D7599F97CFC59
in the original text, this letter B is present
is there a way to fix this?