py-pdf / fpdf2

Simple PDF generation for Python
https://py-pdf.github.io/fpdf2/
GNU Lesser General Public License v3.0
1.12k stars 254 forks source link

PDF/A compliance #262

Open Lucas-C opened 3 years ago

Lucas-C commented 3 years ago

I'm opening this issue to track work to ensure PDF/A-compliant can be generated using fpdf2.

Wikipedia page about PDF/A: https://en.wikipedia.org/wiki/PDF/A

PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents.

My current idea would to provide a get_pdfa_compliance() method that would return None or 'PDF/A-1' depending on several criteria:

Feedback & all contributions are welcome on this subject

EconometricsBySimulation commented 1 year ago

@Lucas-C this is a great list. Sorry I did not see this topic earlier.

Here is a screenshot from the readout from Adobe's accessibility checker: Adobe Screenshot

This is generated from a pdf which has been generated using the following code:

from fpdf import FPDF
import lorem

pdf = FPDF()

pdf.set_title(f"Sample PDF")
pdf.set_lang("English")

pdf.add_page()
pdf.set_font("Arial", size=10)

pdf.cell(180, 10, txt="Welcome to a PDF generated in Python's fpdf2 package",align="C", new_y="NEXT", new_x="LMARGIN")

for i in range(5):
    print(i)
    pdf.multi_cell(180, 10, txt=f"{i+1}) " + lorem.paragraph(),align="L", new_y="NEXT", new_x="LMARGIN")

pdf.output("simple_demo.pdf")

In addition, according to Acrobat there needs to be

andersonhc commented 1 year ago

Can you try adding the code below in your sample code and see if you still get error on the title? I didn't research much about PDF/A but I suspect it demands the metadata as XMP

pdf.set_xmp_metadata("""<x:xmpmeta xmlns:x="adobe:ns:meta/">
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
            <dc:format>application/pdf</dc:format>
            <dc:title>
                <rdf:Alt>
                    <rdf:li xml:lang="x-default">Sample PDF</rdf:li>
                </rdf:Alt>
            </dc:title>
            <dc:language>
                <rdf:Bag>
                    <rdf:li>en-US</rdf:li>
                </rdf:Bag>
            </dc:language>
        </rdf:Description>
    </rdf:RDF>
</x:xmpmeta>""")
EconometricsBySimulation commented 1 year ago

I implemented the code as suggested and Adobe still flagged it. I "fixed" the Title and this is the metadata that pikepdf read from it:

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 9.1-c001 79.2a0d8d9, 2023/03/14-11:19:46        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">Sample PDF</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:language>
            <rdf:Bag>
               <rdf:li>en-US</rdf:li>
            </rdf:Bag>
         </dc:language>
         <xmp:ModifyDate>2023-06-01T21:45:58-04:00</xmp:ModifyDate>
         <xmp:MetadataDate>2023-06-01T21:45:58-04:00</xmp:MetadataDate>
         <xmpMM:DocumentID>uuid:cfac003d-eb66-694a-b654-1c17c505700b</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:2d2aa7ab-ab4f-1d42-9e8d-8019b057371d</xmpMM:InstanceID>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

I have tried to copy the metadata from a "corrected" pdf to that of a problematic pdf using PikePdf and I have had no success.

I also made an issue request under PikePdf: https://github.com/pikepdf/pikepdf/issues/469