sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

Extract all Text Objects #83

Closed Banguiskode closed 4 years ago

Banguiskode commented 4 years ago

Hi ! First of all, thank you for this great tool. Let me ask you a question: I would like to be able to extract a table (or a list) containing the text objects with their properties, is that possible? Thanks

sambitdash commented 4 years ago

@Banguiskode thank you for your interest in the library. Your expectations are captured as enhancements #2, #7, #11 and #17.

PDF as a specification does not have any simple mechanism of specifying tabular structures as tables unless you post process the text positions extracted from the PDF files. While the API does not provided a very explicit API for the same, pdPageEvaluate can be extended to extract the text data and their positions. As part of tagged specification PDF supports specifying the tabular structure representations but a very small portion of the PDF files available in the market actually implement those specifications to a great extent. If you will like to contribute to any parts of PDFIO by implementing any of the features, we will be happy to accept PRs.

Since, the intent of the issue is already captured as part of other issues, I will close the issue with this comment.

Banguiskode commented 4 years ago

Thank you very much for your answer !