ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
518 stars 69 forks source link

A sentence from columns #95

Open vanushkin opened 3 years ago

vanushkin commented 3 years ago

Dear developers, I'm having a following issue: when processing pdfs that have text formatted in columns I'm getting a sentence that consists of several lines combined from those columns. It just makes a mess out of text. Is there any solution to this problem? Or a hint how I can retain the structure of initial text?

MarcinKosinski commented 2 years ago

@vanushkin please look at tabulizer R package that deals with it

aourednik commented 1 year ago

@MarcinKosinski I would love to try this solution, but tabulizer has been removed from CRAN and it has a java jar dependency whose execution is blocked by default on the computers in my office. No chance to have the sysadmins unblock it. When I export a well-formed pdf "as txt" from Adobe Acrobat, the text-flow is respected despite there being 2 columns. There must be something in the PDF inner markup that identifies the text flow. Couldn't pdftools get the text flow from that information?

jeroen commented 1 year ago

Actually this is not stored in the pdf inner markup: https://ropensci.org/blog/2018/12/14/pdftools-20 I think the tabulizer tries to guess the layout of columns and tables based on whitespace.

aourednik commented 1 year ago

@jeroen I've tried with a PDF file generated by Illustrator (see attached file). Despite the layout's relative complexity, Acrobat recognizes the order of the frames I've defined. This flow order must be stored somewhere, otherwise this would not be possible. Acrobat cannot just guess this on the fly.

Perhaps some inner markup elements specific to Acrobat products?

image image

test-text-flow.pdf