plazi / GoldenGATE-Imagine

A GUI Tool For Freeing Text and Data from PDF Documents
Other
5 stars 0 forks source link

file does not open: rangelLandPlants #18

Open myrmoteras opened 3 years ago

myrmoteras commented 3 years ago

GgImagine.20210617-1032.err.zip

this file does not open at all

the file 36MB is here https://drive.google.com/file/d/11xjsoiome_cPwMd0H9O4AsSMa910fkry/view?usp=sharing

gsautter commented 3 years ago

Another very strange one ... for what it looks like, it's what I've come to call a "hybrid" PDF: it originates from scanning, but doesn't retain the actual image of the text, only the background, representing the text exclusively like a born-digital PDF does (most likely as a means of compression).

Decoding such PDFs requires a yet-to-build hybrid mode, reading it as born-digital (i.e., not expecting words to have a counterpart in the scan), but not turning the text into image labels despite them being located on top of a bitmap image.

For now, decoding as born-digital is the way to go about such PDFs, but that's only a general observation about this type of PDF.

gsautter commented 3 years ago

Decoding this PDF as born-digital (there is even a good bunch of vector graphics, so fairly well digitized), I have words come up OK when testing with individual pages, if with some apparent positioning problems.

The error log you sent did shed some light, though: like any other PDF, this one uses fonts to represent text, but this one seems to have too many characters crammed into a single font, namely over 255, so the decoder simply runs out of bytes to represent the characters with ... this basically means I'll have to find a way of accommodating fonts with 256+ characters while not breaking our existing IMFs, and also adjust all character code handling logic to cope with other situations than 2 hex digits of character code per word character ... will take some thinking, to say the least ...

gsautter commented 3 years ago

The word positioning problems seem to stem from erroneous character width handling ...

gsautter commented 3 years ago

Looking at the fonts more closely, the large number of characters per font seems to be a result of extreme compression, namely a result of encoding frequent combinations of multiple letters into a single byte ... is there a way of adjusting this in ABBYY, maybe?

gsautter commented 3 years ago

Solved the character width (and its entailed word positioning) problem ... was just a freaky way of splitting up font data into multiple objects that needed handling ... wasn't even aware it's legal so far, and it surely isn't frequent, as this is the first instance in over 30K PDFs we've run through the decoder so far ... anyway, fixed that part now, comes with next build.

gsautter commented 3 years ago

Looked closer into the fonts, and they contain a good bunch of duplicate characters ... result of a DjVu compression, for what it looks like ... will consider that in the upcoming "hybrid" PDF mode ... maybe name that mode "PDF (scanned & DjVu compressed)".