py-pdf / fpdf2

Simple PDF generation for Python
https://py-pdf.github.io/fpdf2/
GNU Lesser General Public License v3.0
1.06k stars 243 forks source link

New feature: FPDF.table() #701

Closed Lucas-C closed 1 year ago

Lucas-C commented 1 year ago

Current situation fpdf2 currently let users employ the cell() & multi_cell() methods to build tables, as demonstrated in part 5 of our tutorial: https://pyfpdf.github.io/fpdf2/Tutorial.html#tuto-5-creating-tables We also have some recipes regarding building tables in our documentation: https://pyfpdf.github.io/fpdf2/Tables.html

Based on the feedbacks in several table-related issues & discussions opened on this GitHub project, it seems to me that a FPDF.table() method would be very handy for our users.

Features It would be ideal that the end implementation provides the following set of features:

Method design In issue #680 I pitched the following API for this feature:

from fpdf import FPDF

pdf = FPDF()
with pdf.table() as table:
    table.col_widths = ...  # optional
    with table.row() as row:
        row.cell(...)  # or row.image(...)

Regarding this, feedbacks and alternative suggestions are very welcome! 😊 Here is what I like about this one:

Lucas-C commented 1 year ago

The PR is almost ready: #703

MartinThoma commented 1 year ago

Hey! I'm Martin, the maintainer of pypdf and PyPDF2 :wave:

Do you think the table-feature could be added in a way that it's possible to read the table structure from the PDF (programmatically, without heuristics)?

MartinThoma commented 1 year ago

I was thinking about "14.6 Marked Content", see https://accessible-pdf.info/basics/general/overview-of-the-pdf-tags

Lucas-C commented 1 year ago

Thank you for reaching out @MartinThoma!

Yes, this is a really good suggestion. It shouldn't be difficult to add, as we already have the necessary building block: https://github.com/PyFPDF/fpdf2/blob/2.6.1/fpdf/fpdf.py#L3799

However, I am not sure how best to test that we implement this right... Would you recommend any tool I could use to check that table content can be properly extracted based on marked content? I only know https://github.com/camelot-dev/camelot, but is is not based on marked content tags.

MartinThoma commented 1 year ago

Good question! I want to give those capabilities to pypdf in the long run, but right now we are not there yet.

Looking at some libraries:

I've actually asked this several years ago and haven't received an answer: How can I extract all PDF Tags related to content with Python?

Lucas-C commented 1 year ago

Thank you for the detailed answer @MartinThoma! I have also found this screenshot that illutrates table tagged elements:

I have just added a commit to PR https://github.com/PyFPDF/fpdf2/pull/703 related to this: 46bc617 (#703). It contains:

I was not able to find examples of using pdfminer to extract tables from PDF docs. Regarding PyMuPDF, the GitHub issue you pointed seems to indicate that it does NOT support table data extraction. For tika-python, I am going to wait for the answer to the question you asked.

Given that, among tools dedicated to PDF-tables extraction, none of them uses PDF tags / annotations in the process of doing their job, I am not sure that adding PDF tags is really worthwile... At least not in a systematical way. An optional tag=True argument could later be added to FPDF.table(), but I don't think it's necessary in the initial version.

What do you think about this @MartinThoma?

MartinThoma commented 1 year ago

Wow, you're amazing :heart_eyes:

Regarding PyMuPDF, the GitHub issue you pointed seems to indicate that it does NOT support table data extraction.

Oops, my bad, I mistyped :see_no_evil:

Given that, among tools dedicated to PDF-tables extraction, none of them uses PDF tags / annotations in the process of doing their job, I am not sure that adding PDF tags is really worthwile

Yes, I understand. It's a bit of a henn-egg-problem. Please don't forget that screen readers / accessibility solutions might use the tags as well. I think the tags were originally designed for them. But here I have no knowledge.

I don't think it's necessary in the initial version

I agree :+1: