pypdf_table_extraction (camelot) and gmft?

conjuncts commented 1 day ago

Hello,

Thank you so much for continuing the development of camelot! I'm glad to see that camelot continues to be maintained.

I happen to also manage a pdf extraction library, gmft. My goal is to encapsulate multiple pdf table extraction options into one consistent format. I think camelot is a great option for its high throughput performance and focus on non-deep detection. Consequently, I would love to support integration between camelot and gmft. I normally try to bridge the gap myself, but based on the complexities and differing approaches of the 2 libraries, I think it must require some sort of mutual cooperation. Please let me know what you think!

Challenges:

My goal with gmft is to be pdf parser agnostic. I currently support pymupdf and pypdfium2. Meanwhile, pypdf_table_extraction uses solely pypdf.
- (unless #89 goes through.)
camelot has a different set of dependencies (ghostscript, poppler, sqlite3)

Edit: this is not a bug, oops!

stefan6419846 commented 19 hours ago

Could you please elaborate why you consider the mentioned aspects challenges? pypdf_table_extraction is a library which you can call by its public APIs - IMHO it should not really matter how it is implemented internally and/or which external packages it depends on.

conjuncts commented 7 hours ago

Sure. This is how my library is structured:

Documents: PyMuPDFDocument and PyPDFium2Document Detectors: TATRDetector and Img2TableDetector Structure analysis: TATRFormatter, hopefully CamelotFormatter.

Basically, it's the abstraction. From my perspective, pdf handling (ghostscript, poppler, pdfminer.six) can be abstracted into these features:

get word text content and bboxes
get an image of the page

The pdf handling can be encapsulated into that common interface. After that point, things become interchangeable, and the table structure algorithm does not need to know the pdf handler method. I think it's nice to separate the pdf handling logic from the table recognition logic.

The challenge is that for camelot to be a detector or a formatter in my library, it would need to work with pymupdf or pypdfium2, even though camelot internally uses pdfminer.six/ghostscript/poppler.

I was worried that the camelot would be tightly coupled to pdfminer.six, but under a closer look at lattice.py, it actually seems surprisingly doable. Solely focusing on getting lattice.py to work with pypdfium2:

backend can be swapped out
only thing that relies on pdfminer.six is text_in_bbox_per_axis() from utils.py
the data comes from prepare_page_parse() in BaseParser
the text bboxes comes from get_text_objects() from utils.py
gives LTChar, LTImage, LTTextLineHorizontal, LTTextLineVertical
could probably swap that with text objects from pymupdf or pypdfium2

The fact that LTObjects are stored internally (as horizontal_text, vertical_text, all_textlines) does make it messy, but might still be doable:

initialize a LTTextLineHorizontal based on text and bbox
or, try to mock these fields:
get_text(), bbox, x0, x1, y0, y1, height, width, matrix, more?
make sure that the mocked LTObject works on every function that reads a textline (ie. text_in_bbox, textlines_overlapping_bbox)

I guess camelot having external dependencies isn't really an issue - that's true. What remains difficult is adapting camelot to support an entirely different pdf parser. Beyond just integration into my library, I think there are also conceptual advantages to this abstraction. So I might send in a pull request, but obviously it will be very messy.

py-pdf / pypdf_table_extraction

pypdf_table_extraction (camelot) and gmft? #174