Closed Lucas-C closed 1 year ago
The PR is almost ready: #703
Hey! I'm Martin, the maintainer of pypdf and PyPDF2 :wave:
Do you think the table-feature could be added in a way that it's possible to read the table structure from the PDF (programmatically, without heuristics)?
I was thinking about "14.6 Marked Content", see https://accessible-pdf.info/basics/general/overview-of-the-pdf-tags
Thank you for reaching out @MartinThoma!
Yes, this is a really good suggestion. It shouldn't be difficult to add, as we already have the necessary building block: https://github.com/PyFPDF/fpdf2/blob/2.6.1/fpdf/fpdf.py#L3799
However, I am not sure how best to test that we implement this right... Would you recommend any tool I could use to check that table content can be properly extracted based on marked content? I only know https://github.com/camelot-dev/camelot, but is is not based on marked content tags.
Good question! I want to give those capabilities to pypdf
in the long run, but right now we are not there yet.
Looking at some libraries:
I've actually asked this several years ago and haven't received an answer: How can I extract all PDF Tags related to content with Python?
Thank you for the detailed answer @MartinThoma! I have also found this screenshot that illutrates table tagged elements:
I have just added a commit to PR https://github.com/PyFPDF/fpdf2/pull/703 related to this: 46bc617
(#703). It contains:
fpdf2
, using camelot
or tabula
I was not able to find examples of using pdfminer
to extract tables from PDF docs.
Regarding PyMuPDF
, the GitHub issue you pointed seems to indicate that it does NOT support table data extraction.
For tika-python
, I am going to wait for the answer to the question you asked.
Given that, among tools dedicated to PDF-tables extraction, none of them uses PDF tags / annotations in the process of doing their job, I am not sure that adding PDF tags is really worthwile...
At least not in a systematical way.
An optional tag=True
argument could later be added to FPDF.table()
, but I don't think it's necessary in the initial version.
What do you think about this @MartinThoma?
Wow, you're amazing :heart_eyes:
Regarding PyMuPDF, the GitHub issue you pointed seems to indicate that it does NOT support table data extraction.
Oops, my bad, I mistyped :see_no_evil:
Given that, among tools dedicated to PDF-tables extraction, none of them uses PDF tags / annotations in the process of doing their job, I am not sure that adding PDF tags is really worthwile
Yes, I understand. It's a bit of a henn-egg-problem. Please don't forget that screen readers / accessibility solutions might use the tags as well. I think the tags were originally designed for them. But here I have no knowledge.
I don't think it's necessary in the initial version
I agree :+1:
Current situation
fpdf2
currently let users employ thecell()
&multi_cell()
methods to build tables, as demonstrated in part 5 of our tutorial: https://pyfpdf.github.io/fpdf2/Tutorial.html#tuto-5-creating-tables We also have some recipes regarding building tables in our documentation: https://pyfpdf.github.io/fpdf2/Tables.htmlBased on the feedbacks in several table-related issues & discussions opened on this GitHub project, it seems to me that a
FPDF.table()
method would be very handy for our users.Features It would be ideal that the end implementation provides the following set of features:
colspan
)fpdf/html.py
by a call to this newFPDF.table()
methodMethod design In issue #680 I pitched the following API for this feature:
Regarding this, feedbacks and alternative suggestions are very welcome! 😊 Here is what I like about this one:
table()
context, which mean that we'll be able to perform some calculations on the row heights / column widths based on all the table content provideddata
object provided in one go to atable()
method, while still making it easy to build a table based on such bigdata
dictionary / sequencetable()
. The image() method for example, with its 11 parameters, is becoming a bit difficult to apprehend.