Open EconometricsBySimulation opened 1 year ago
Hi @EconometricsBySimulation!
Thank you for reaching out, and sorry for the delay to answer.
fpdf2
already implement tagging and adds several tags to the elements it generates in PDF documents.
However fpdf2
probably does not honor all the accessibility criteria of Adobe Acrobat Pro.
We have already some issues regarding fpdf2
compliance to some known PDF standards:
@EconometricsBySimulation: would you like to help regarding this subject?
Would you like to contribute to fpdf2
regarding this?
Or else, could you help by sharing with us the smallest PDF possible that you can produce (with associated Python code) that raises this warning, with as much details possible on this warning?
Hi Lucas,
I would be happy to contribute where possible. I admit though, pdf codex seems pretty hard to access. Do you have suggestions to what part of the code I should look at to begin to address some of these issues?
In particular, I am most concerned with addressing the "element tagging" issue. This seems to be the most difficult to address post pdf generation as a single pdf may have dozens or hundreds of elements (cells).
If there can be a simple method to simply auto-tag cells as "real" content - whatever the opposite of "artifact" is, this could go a long way.
I am not sure why the Acrobat does not like the pdf.set_title()
method.
Interestingly when I attempt to read the metadata using pikepdf it reports
that there is no metadata. Does this mean that the set_title
method is
writing to a place that is not the metadata?
See simple pdf code generated from:
from fpdf import FPDF
import lorem
pdf = FPDF()
pdf.set_title(f"Sample PDF")
pdf.set_lang("English")
pdf.add_page()
pdf.set_font("Arial", size=10)
pdf.cell(180, 10, txt="Welcome to a PDF generated in Python's fpdf2
package",align="C", new_y="NEXT", new_x="LMARGIN")
for i in range(5):
print(i)
pdf.multi_cell(180, 10, txt=f"{i+1}) " + lorem.paragraph(),align="L",
new_y="NEXT", new_x="LMARGIN")
pdf.output("simple_demo.pdf")
Now let's check the metadata:
import pikepdf
pdf = pikepdf.open("simple_demo.pdf", allow_overwriting_input=True)
print(pdf.Root.Metadata.read_bytes().decode())
Oh interesting, today I learned that Markdown is not rendered in GitHub comments sent by email.
I am not sure why the Acrobat does not like the
pdf.set_title()
method.Interestingly when I attempt to read the metadata using pikepdf it reports that there is no metadata. Does this mean that the
set_title
method is writing to a place that is not the metadata?
By the way, have you checked our documentation on PDF metadata?
https://pyfpdf.github.io/fpdf2/Metadata.html
It will help you to understand why set_title()
sets the document title in an "old-fashioned way",
and Adobe accessibility checker probably expects XML metdata.
If there can be a simple method to simply auto-tag cells as "real" content - whatever the opposite of "artifact" is, this could go a long way.
Good question, but I don't have a quick & easy answer...
In the spec ( https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf ), at the begininng of section - 14.8 Tagged PDF", we find this:
Tagged PDF defines a set of rules for representing text in the page content so that characters, words, and text order can be determined reliably. All text shall be represented in a form that can be converted to Unicode. Word breaks shall be represented explicitly. Actual content shall be distinguished from artifacts of layout and pagination. Content shall be given in an order related to its appearance on the page, as determined by the conforming writer.
And then in section 14.8.2.2 Real Content and Artifacts:
Artifacts are graphics objects that are not part of the author’s original content but rather are generated by the conforming writer in the course of pagination, layout, or other strictly mechanical processes.
Now it is technically relatively easy to add /Artifact BDC ... EMC
surrounding information to elements in a content stream,
but the main question, for me, remains: on what elements exactly should we add those tags in order to please the accessibility checker?
Dear Creators,
FPDF2 is awesome! Thank you very much!
I am attempting to generate pdfs which are Accessible in US Federal Government terms that means 508 compliant.
I am getting one major issue using the current version of FPDF2. Using Adobe Acrobat Pro to check Accessibility:
These are basically every time I used the
pdf.cell()
function. Apparently these values need to be tagged as either "Artifacts" or some kind of real content. I am not supper familiar with tags but I looked at this website for reference: https://accessible-pdf.info/basics/general/overview-of-the-pdf-tagsOutside of that there are two things that don't quite work as expected:
The tagging feature seems to be a necessary add for accessibility since adobe auto tagging does not seem to work very well. the other two improvements are needed for large volume document generation (which is what I would like to do). I am very impressed with FPDF2 and am very grateful though these minor but important features will limit my ability to implement the package in my regular workflow.
Thank you so much! Francis