Open conjuncts opened 1 day ago
Could you please elaborate why you consider the mentioned aspects challenges? pypdf_table_extraction
is a library which you can call by its public APIs - IMHO it should not really matter how it is implemented internally and/or which external packages it depends on.
Sure. This is how my library is structured:
Documents: PyMuPDFDocument and PyPDFium2Document
Detectors: TATRDetector and Img2TableDetector
Structure analysis: TATRFormatter
, hopefully CamelotFormatter
.
Basically, it's the abstraction. From my perspective, pdf handling (ghostscript, poppler, pdfminer.six) can be abstracted into these features:
The pdf handling can be encapsulated into that common interface. After that point, things become interchangeable, and the table structure algorithm does not need to know the pdf handler method. I think it's nice to separate the pdf handling logic from the table recognition logic.
The challenge is that for camelot to be a detector or a formatter in my library, it would need to work with pymupdf or pypdfium2, even though camelot internally uses pdfminer.six/ghostscript/poppler.
I was worried that the camelot would be tightly coupled to pdfminer.six, but under a closer look at lattice.py, it actually seems surprisingly doable. Solely focusing on getting lattice.py to work with pypdfium2:
text_in_bbox_per_axis()
from utils.pyprepare_page_parse()
in BaseParserget_text_objects()
from utils.pyThe fact that LTObjects are stored internally (as horizontal_text
, vertical_text
, all_textlines
) does make it messy, but might still be doable:
get_text(), bbox, x0, x1, y0, y1, height, width, matrix
, more?I guess camelot having external dependencies isn't really an issue - that's true. What remains difficult is adapting camelot to support an entirely different pdf parser. Beyond just integration into my library, I think there are also conceptual advantages to this abstraction. So I might send in a pull request, but obviously it will be very messy.
Hello,
Thank you so much for continuing the development of camelot! I'm glad to see that camelot continues to be maintained.
I happen to also manage a pdf extraction library, gmft. My goal is to encapsulate multiple pdf table extraction options into one consistent format. I think camelot is a great option for its high throughput performance and focus on non-deep detection. Consequently, I would love to support integration between camelot and gmft. I normally try to bridge the gap myself, but based on the complexities and differing approaches of the 2 libraries, I think it must require some sort of mutual cooperation. Please let me know what you think!
Challenges:
Edit: this is not a bug, oops!