Open myrmoteras opened 3 years ago
Another very strange one ... for what it looks like, it's what I've come to call a "hybrid" PDF: it originates from scanning, but doesn't retain the actual image of the text, only the background, representing the text exclusively like a born-digital PDF does (most likely as a means of compression).
Decoding such PDFs requires a yet-to-build hybrid mode, reading it as born-digital (i.e., not expecting words to have a counterpart in the scan), but not turning the text into image labels despite them being located on top of a bitmap image.
For now, decoding as born-digital is the way to go about such PDFs, but that's only a general observation about this type of PDF.
Decoding this PDF as born-digital (there is even a good bunch of vector graphics, so fairly well digitized), I have words come up OK when testing with individual pages, if with some apparent positioning problems.
The error log you sent did shed some light, though: like any other PDF, this one uses fonts to represent text, but this one seems to have too many characters crammed into a single font, namely over 255, so the decoder simply runs out of bytes to represent the characters with ... this basically means I'll have to find a way of accommodating fonts with 256+ characters while not breaking our existing IMFs, and also adjust all character code handling logic to cope with other situations than 2 hex digits of character code per word character ... will take some thinking, to say the least ...
The word positioning problems seem to stem from erroneous character width handling ...
Looking at the fonts more closely, the large number of characters per font seems to be a result of extreme compression, namely a result of encoding frequent combinations of multiple letters into a single byte ... is there a way of adjusting this in ABBYY, maybe?
Solved the character width (and its entailed word positioning) problem ... was just a freaky way of splitting up font data into multiple objects that needed handling ... wasn't even aware it's legal so far, and it surely isn't frequent, as this is the first instance in over 30K PDFs we've run through the decoder so far ... anyway, fixed that part now, comes with next build.
Looked closer into the fonts, and they contain a good bunch of duplicate characters ... result of a DjVu compression, for what it looks like ... will consider that in the upcoming "hybrid" PDF mode ... maybe name that mode "PDF (scanned & DjVu compressed)".
GgImagine.20210617-1032.err.zip
this file does not open at all
the file 36MB is here https://drive.google.com/file/d/11xjsoiome_cPwMd0H9O4AsSMa910fkry/view?usp=sharing