Real Content and Artifact Tags for Accessibility

EconometricsBySimulation commented 1 year ago

Dear Creators,

FPDF2 is awesome! Thank you very much!

I am attempting to generate pdfs which are Accessible in US Federal Government terms that means 508 compliant.

I am getting one major issue using the current version of FPDF2. Using Adobe Acrobat Pro to check Accessibility:

I get the error: "Tagged contents - Failed" Element 1 Element 2 etc.

These are basically every time I used the pdf.cell() function. Apparently these values need to be tagged as either "Artifacts" or some kind of real content. I am not supper familiar with tags but I looked at this website for reference: https://accessible-pdf.info/basics/general/overview-of-the-pdf-tags

Outside of that there are two things that don't quite work as expected:

The "TItle" despite being set with set_title() is not by default recognized by Acrobat. When I say, "fit it" it seems to correct it with not issue without asking me for a title so I think it is finding something. But it is not quite right.
It would be very helpful to be able to set the "tab order". This seems to be something that I can right click on "Fix" in acrobat like the title and it seems to work.

The tagging feature seems to be a necessary add for accessibility since adobe auto tagging does not seem to work very well. the other two improvements are needed for large volume document generation (which is what I would like to do). I am very impressed with FPDF2 and am very grateful though these minor but important features will limit my ability to implement the package in my regular workflow.

Thank you so much! Francis

Lucas-C commented 1 year ago

Hi @EconometricsBySimulation!

Thank you for reaching out, and sorry for the delay to answer.

fpdf2 already implement tagging and adds several tags to the elements it generates in PDF documents. However fpdf2 probably does not honor all the accessibility criteria of Adobe Acrobat Pro.

We have already some issues regarding fpdf2 compliance to some known PDF standards:

@EconometricsBySimulation: would you like to help regarding this subject?

Would you like to contribute to fpdf2 regarding this?

Or else, could you help by sharing with us the smallest PDF possible that you can produce (with associated Python code) that raises this warning, with as much details possible on this warning?

EconometricsBySimulation commented 1 year ago

Hi Lucas,

I would be happy to contribute where possible. I admit though, pdf codex seems pretty hard to access. Do you have suggestions to what part of the code I should look at to begin to address some of these issues?

In particular, I am most concerned with addressing the "element tagging" issue. This seems to be the most difficult to address post pdf generation as a single pdf may have dozens or hundreds of elements (cells).

If there can be a simple method to simply auto-tag cells as "real" content - whatever the opposite of "artifact" is, this could go a long way.

I am not sure why the Acrobat does not like the pdf.set_title() method.

Interestingly when I attempt to read the metadata using pikepdf it reports that there is no metadata. Does this mean that the set_title method is writing to a place that is not the metadata?

See simple pdf code generated from:

from fpdf import FPDF
import lorem

pdf = FPDF()

pdf.set_title(f"Sample PDF")
pdf.set_lang("English")

pdf.add_page()
pdf.set_font("Arial", size=10)

pdf.cell(180, 10, txt="Welcome to a PDF generated in Python's fpdf2
package",align="C", new_y="NEXT", new_x="LMARGIN")

for i in range(5):
    print(i)
    pdf.multi_cell(180, 10, txt=f"{i+1}) " + lorem.paragraph(),align="L",
new_y="NEXT", new_x="LMARGIN")

pdf.output("simple_demo.pdf")

Now let's check the metadata:

import pikepdf
pdf = pikepdf.open("simple_demo.pdf", allow_overwriting_input=True)
print(pdf.Root.Metadata.read_bytes().decode())

Lucas-C commented 1 year ago

Oh interesting, today I learned that Markdown is not rendered in GitHub comments sent by email.

I am not sure why the Acrobat does not like the pdf.set_title() method.

Interestingly when I attempt to read the metadata using pikepdf it reports that there is no metadata. Does this mean that the set_title method is writing to a place that is not the metadata?

By the way, have you checked our documentation on PDF metadata? https://pyfpdf.github.io/fpdf2/Metadata.html It will help you to understand why set_title() sets the document title in an "old-fashioned way", and Adobe accessibility checker probably expects XML metdata.

If there can be a simple method to simply auto-tag cells as "real" content - whatever the opposite of "artifact" is, this could go a long way.

Good question, but I don't have a quick & easy answer...

In the spec ( https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf ), at the begininng of section - 14.8 Tagged PDF", we find this:

Tagged PDF defines a set of rules for representing text in the page content so that characters, words, and text order can be determined reliably. All text shall be represented in a form that can be converted to Unicode. Word breaks shall be represented explicitly. Actual content shall be distinguished from artifacts of layout and pagination. Content shall be given in an order related to its appearance on the page, as determined by the conforming writer.

And then in section 14.8.2.2 Real Content and Artifacts:

Artifacts are graphics objects that are not part of the author’s original content but rather are generated by the conforming writer in the course of pagination, layout, or other strictly mechanical processes.

Now it is technically relatively easy to add /Artifact BDC ... EMC surrounding information to elements in a content stream, but the main question, for me, remains: on what elements exactly should we add those tags in order to please the accessibility checker?

py-pdf / fpdf2

Real Content and Artifact Tags for Accessibility #792