`pdPageExtractText` should support multi-column documents

sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.

Other

128 stars 13 forks source link

`pdPageExtractText` should support multi-column documents #17

Open sambitdash opened 6 years ago

sambitdash commented 6 years ago

This implementation may be needed to be reviewed along with #2. Although, there may not be an exact overlap in some cases the implementation logic can be similar.

Nosferican commented 3 years ago

Is there any way to currently do this?

sambitdash commented 3 years ago

Not really. You can manually estimate every textrun and see if they form a column. The specification does not provide any structural hints for the same.

vargonis commented 1 year ago

On a related note, since by the nature of the format the output of pdPageExtractText is not fully determined, it would be useful to:

Have access to character level information (font, bounding box and so on).
Document what the word inference and ordering heuristics are.

sambitdash commented 1 year ago

@vargonis you can use pdPageEvalContent and get the content tree. The content tree has all the bounding box information at a text run level.