gsautter / goldengate-imagine

Automatically exported from code.google.com/p/goldengate-imagine
Other
1 stars 0 forks source link

taxon: conversion producing no words as token, space between each letter #916

Open myrmoteras opened 4 years ago

myrmoteras commented 4 years ago

image

this using render glyphs only.

taxon.69.3.567-577.pdf

gsautter commented 4 years ago

Reproduced ... will investigate.

I must say this is a strange PDF, though ... every second page is blank in Acrobat, and even when copy&pasting the text from Acrobat, there are some strange spaces in the middle of some words ...

gsautter commented 4 years ago

Looks like a problem with the implicit spacing feature ... a rarely used countermeasure for a strange obfuscation technique, introduced in reaction to another strange PDF in which the words clang together because the spaces were rendered as part of the characters proper ... and here it seems to backfire.

gsautter commented 4 years ago

Turns out the implicit space detection feature is the culprit, with one font just over the threshold by a coat of paint ... after increasing said threshold, the PDF decodes fine.

Still left to wonder why this PDF renders all the characters individually, though ... this is pretty excessive, using far more rendering commands than actually required, with individual char rendering being what triggers implicit space detection in the first place, as in most PDFs words render as coherent units.

gsautter commented 4 years ago

This might be a hint as to why the PDF is so strange (extracted from the raw PDF with a text editor):

<</Creator (Mozilla/5.0 \(Windows NT 10.0; Win64; x64\) AppleWebKit/537.36 \(KHTML, like Gecko\) Chrome/85.0.4183.83 Safari/537.36)
/Producer (Skia/PDF m85)
/CreationDate (D:20200903114512+00'00')
/ModDate (D:20200903114512+00'00')>>

Also, the table grid on page 569 actually renders as part of a page background figure that fills the whole page, and copy&pasting the illustrations from page 571 in Acrobat produces really strange results ... must say, all in all, a pretty shoddy PDF ...