New feature: FPDF.table()

Lucas-C commented 1 year ago

Current situation fpdf2 currently let users employ the cell() & multi_cell() methods to build tables, as demonstrated in part 5 of our tutorial: https://pyfpdf.github.io/fpdf2/Tutorial.html#tuto-5-creating-tables We also have some recipes regarding building tables in our documentation: https://pyfpdf.github.io/fpdf2/Tables.html

Based on the feedbacks in several table-related issues & discussions opened on this GitHub project, it seems to me that a FPDF.table() method would be very handy for our users.

Features It would be ideal that the end implementation provides the following set of features:

support cells with content wrapping over several lines
control over column & row sizes, or by default let them be automatically computed
control over text alignment in cells, with rules by column or row
allow to set table headings, styled differently, but make this optional
control table width
honor the initial X / Y current position to render the table, and allow to easily center it in the page
handle splitting a table over page breaks, with headings repeated
allow to embed images in cells
control over borders: color, width & where they are drawn (e.g. allow to not draw the surrounding square, allow to only draw the horizontal line above the headings, etc.) Also: control thickness of border below headings
control over cell background, through a callback function to allow maximum customization
(bonus) allow for several cells to be merged horizontally (aka colspan)
(bonus) replace the table-building logic in fpdf/html.py by a call to this new FPDF.table() method

Method design In issue #680 I pitched the following API for this feature:

from fpdf import FPDF

pdf = FPDF()
with pdf.table() as table:
    table.col_widths = ...  # optional
    with table.row() as row:
        row.cell(...)  # or row.image(...)

Regarding this, feedbacks and alternative suggestions are very welcome! 😊 Here is what I like about this one:

it defers the actual table building & rendering to the end of the table() context, which mean that we'll be able to perform some calculations on the row heights / column widths based on all the table content provided
it gives more flexibility to the user than having a huge data object provided in one go to a table() method, while still making it easy to build a table based on such big data dictionary / sequence
requiring several method calls will allow us to "split" control parameters between those methods, and limit the number of parameters passed to table(). The image() method for example, with its 11 parameters, is becoming a bit difficult to apprehend.

Lucas-C commented 1 year ago

The PR is almost ready: #703

MartinThoma commented 1 year ago

Hey! I'm Martin, the maintainer of pypdf and PyPDF2 :wave:

Do you think the table-feature could be added in a way that it's possible to read the table structure from the PDF (programmatically, without heuristics)?

MartinThoma commented 1 year ago

I was thinking about "14.6 Marked Content", see https://accessible-pdf.info/basics/general/overview-of-the-pdf-tags

Lucas-C commented 1 year ago

Thank you for reaching out @MartinThoma!

Yes, this is a really good suggestion. It shouldn't be difficult to add, as we already have the necessary building block: https://github.com/PyFPDF/fpdf2/blob/2.6.1/fpdf/fpdf.py#L3799

However, I am not sure how best to test that we implement this right... Would you recommend any tool I could use to check that table content can be properly extracted based on marked content? I only know https://github.com/camelot-dev/camelot, but is is not based on marked content tags.

MartinThoma commented 1 year ago

Good question! I want to give those capabilities to pypdf in the long run, but right now we are not there yet.

Looking at some libraries:

Tika / PdfBox has it, but tika-python probably not: https://github.com/chrismattmann/tika-python/issues/393
pdfminer.six: They claim they support it, but I couldn't figure out how to use it https://github.com/pdfminer/pdfminer.six/issues/868
PyMuPDF seems not to to be able to do it

I've actually asked this several years ago and haven't received an answer: How can I extract all PDF Tags related to content with Python?

Lucas-C commented 1 year ago

Thank you for the detailed answer @MartinThoma! I have also found this screenshot that illutrates table tagged elements:

I have just added a commit to PR https://github.com/PyFPDF/fpdf2/pull/703 related to this: 46bc617 (#703). It contains:

unit tests ensuring tables can be extracted from PDF docs generated with fpdf2, using camelot or tabula
some guidelines in the documentation: https://github.com/PyFPDF/fpdf2/blob/table/docs/Tables.md#parsabilty-of-the-tables-generated

I was not able to find examples of using pdfminer to extract tables from PDF docs. Regarding PyMuPDF, the GitHub issue you pointed seems to indicate that it does NOT support table data extraction. For tika-python, I am going to wait for the answer to the question you asked.

Given that, among tools dedicated to PDF-tables extraction, none of them uses PDF tags / annotations in the process of doing their job, I am not sure that adding PDF tags is really worthwile... At least not in a systematical way. An optional tag=True argument could later be added to FPDF.table(), but I don't think it's necessary in the initial version.

What do you think about this @MartinThoma?

MartinThoma commented 1 year ago

Wow, you're amazing :heart_eyes:

Regarding PyMuPDF, the GitHub issue you pointed seems to indicate that it does NOT support table data extraction.

Oops, my bad, I mistyped :see_no_evil:

Given that, among tools dedicated to PDF-tables extraction, none of them uses PDF tags / annotations in the process of doing their job, I am not sure that adding PDF tags is really worthwile

Yes, I understand. It's a bit of a henn-egg-problem. Please don't forget that screen readers / accessibility solutions might use the tags as well. I think the tags were originally designed for them. But here I have no knowledge.

I don't think it's necessary in the initial version

I agree :+1:

py-pdf / fpdf2

New feature: FPDF.table() #701