mgmeyers / pdfannots2json

GNU Affero General Public License v3.0
42 stars 5 forks source link

Bug/FR: Force two-column extraction #12

Closed chrisgrieser closed 2 years ago

chrisgrieser commented 2 years ago

Not sure whether this is a bug or a feature request, bug could there be an option to force treating the PDF as two-column extraction? Afair, one of the nifty things about pdfannots2json was that it recognizes this automatically, but I a two-column PDF which is treated as a one-column PDF, meaning the order of citations is all jumbled up.

Here a two page PDF sample, with the (beautified) JSON output I get. The annotations are ordered by their y-position, rather than doing one column and then the next sample.zip

mgmeyers commented 2 years ago

Hmm, so it looks like this PDF is malformed some how. I see a ton of parsing errors in the output, and some how the Y axis is getting flipped for the annotations. In most PDFs 0 is the bottom of the PDF, but for the highlights in this PDF, 0 seems to be oriented to the top. Not sure if there's much I can do here.

mgmeyers commented 2 years ago

Never mind, I managed to track down the bug.

chrisgrieser commented 2 years ago

😂 thanks!

mgmeyers commented 2 years ago

@chrisgrieser The new version is on homebrew, try running brew upgrade and see if it fixes the issues for you

chrisgrieser commented 2 years ago

yep, works now! Thanks a lot 🥳