py-pdf / pypdf_table_extraction

A Python library to extract tabular data from PDFs
https://pypdf-table-extraction.readthedocs.io
MIT License
36 stars 14 forks source link

pypdf_table_extraction (camelot) and gmft? #174

Open conjuncts opened 1 day ago

conjuncts commented 1 day ago

Hello,

Thank you so much for continuing the development of camelot! I'm glad to see that camelot continues to be maintained.

I happen to also manage a pdf extraction library, gmft. My goal is to encapsulate multiple pdf table extraction options into one consistent format. I think camelot is a great option for its high throughput performance and focus on non-deep detection. Consequently, I would love to support integration between camelot and gmft. I normally try to bridge the gap myself, but based on the complexities and differing approaches of the 2 libraries, I think it must require some sort of mutual cooperation. Please let me know what you think!

Challenges:

Edit: this is not a bug, oops!

stefan6419846 commented 19 hours ago

Could you please elaborate why you consider the mentioned aspects challenges? pypdf_table_extraction is a library which you can call by its public APIs - IMHO it should not really matter how it is implemented internally and/or which external packages it depends on.

conjuncts commented 7 hours ago

Sure. This is how my library is structured:

Documents: PyMuPDFDocument and PyPDFium2Document Detectors: TATRDetector and Img2TableDetector Structure analysis: TATRFormatter, hopefully CamelotFormatter.

Basically, it's the abstraction. From my perspective, pdf handling (ghostscript, poppler, pdfminer.six) can be abstracted into these features:

The pdf handling can be encapsulated into that common interface. After that point, things become interchangeable, and the table structure algorithm does not need to know the pdf handler method. I think it's nice to separate the pdf handling logic from the table recognition logic.

The challenge is that for camelot to be a detector or a formatter in my library, it would need to work with pymupdf or pypdfium2, even though camelot internally uses pdfminer.six/ghostscript/poppler.

I was worried that the camelot would be tightly coupled to pdfminer.six, but under a closer look at lattice.py, it actually seems surprisingly doable. Solely focusing on getting lattice.py to work with pypdfium2:

The fact that LTObjects are stored internally (as horizontal_text, vertical_text, all_textlines) does make it messy, but might still be doable:

I guess camelot having external dependencies isn't really an issue - that's true. What remains difficult is adapting camelot to support an entirely different pdf parser. Beyond just integration into my library, I think there are also conceptual advantages to this abstraction. So I might send in a pull request, but obviously it will be very messy.