sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

Table picker for PDF #2

Open sambitdash opened 7 years ago

sambitdash commented 7 years ago

Natural tabular objects in a PDF document should ideally be picked up for extraction.

The intent of the project is API development, hence it will be headless for most part. There may not be a WYSIWYG picker available unlike a reader. A heuristic table picker should scan the document for existence of table like structures and dump them in tabular HTML/CSS format or extracted image objects. In cased document tagging is enabled, the table picker can use the tagged text.

hhaensel commented 2 years ago

I have written some lines of code to extract tabular data. Currently it is keyword based to determine the textlayouts to include. I also managed to make short IJulia notebook where you can interactively select text in a Plotly chart. @sambitdash Would you be interested in including that code in your package? Otherwise I might release my own package but I feel that this functionality would nicely fit into PDFIO.

sambitdash commented 2 years ago

@hhaensel thank you for your interest. I want to understand what level of complex cases can this software handle. If you submit a PR, I can review it and let you know if they are useful for this SDK.

hhaensel commented 2 years ago

Sounds perfect, I'll submit a PR tomorrow. The code extracts a vector of TextLayouts as a function of page(s) and keywords, then scans for common elements in rows and columns as a function of their layout box. The layout boxes can be scaled in order to reduce the probability of overlapping areas. Optionally a Plotly graph displays the elements and their recognised arrangement with a color code.

Looking forward to your feedback.

hhaensel commented 2 years ago

Sorry, currently in overload, will take some more time ...