plazi / GoldenGATE-Imagine

A GUI Tool For Freeing Text and Data from PDF Documents
Other
5 stars 0 forks source link

file does not open: scanned: kunz1982 #15

Open myrmoteras opened 3 years ago

myrmoteras commented 3 years ago

kunz1982.pdf GgImagine.20210319-1218.out.zip

gsautter commented 3 years ago

Reproduced, investigating ...

However, the embedded OCR is really good in this one, and also really accurately placed, so you can also decode this one without embedded OCR adjustment and take it from there ... will investigate the error either way.

gsautter commented 3 years ago

For what it looks like, the problem is actually in the wildly intertangled and overlapping "words" erroneously recognized inside the figures ... will try and think up a filtering mechanism ...

gsautter commented 3 years ago

Added a catch now for lines and words that overlap in patterns not usually occurring in actual text, to the effect that line untangling leaves them alone. With this, the PDF in question decodes fine with all 4 embedded OCR modes.