gsautter / goldengate-imagine

Automatically exported from code.google.com/p/goldengate-imagine
Other
1 stars 0 forks source link

PDf causing problems #90

Open myrmoteras opened 7 years ago

myrmoteras commented 7 years ago

THIS ISSUE IS A REMINDER FOR A PDF we MIGHT WANT TO LOOK AT IF SOME TIME IS AVAILABLE. I

spicer1977a.pdf This scanned and OCRed pdf causes all possible problems.

most lines are split in many blocks

tables are annotated where they do not need to be

font issues with lower and upper case capitals that are not recognized as one word

references don't parse

figureCitations cannot be linked to figures/figure citation

gsautter commented 7 years ago

see #89 for some of the points ...

Not sure about the figure citations, that might well be a downstream effect of captions starting "F IGURE" rather than "FIGURE" ...

The references might well be yet another downstream problem of the small-caps word split-up issue ... namely, the authors are in small-caps just like the caption starts, to the effect of having a space inserted in the XML (which reference detection and parsing works on), and throwing pattern matching off track as a consequence.

Lines splitting into multiple blocks is likely due to lavish spacing, and in that very much related to the false positives in table detection ... I tend to assume that mostly two-line paragraphs are affected here? With the lavish spacing introduced by justifying the first line, block splitting can well go over the top ... most likely towards the right side of the page, where the second line might already have ended.

A dedicated block spacing analysis tool (to be built in conjunction with the word merging tool suggested in #89 ) might help here ... gathering some statistics about (a) horizontal distance between blocks or paragraphs with 5 or more lines (which should be less susceptible to wrongful vertical splitting), (b) the predominant number of text columns (most likely in terms of height), and (c) the distance of words within individual paragraphs. Based on that, it should be possible to correct said block splitting issues. Might even go into the block splitting routine for scanned PDFs ...

However, the latter tool involves quite a bit of gathering data and drawing conclusions from it, which means I'll have to try around a lot before getting the trim right - might be something to tackle when I have a week or so off.