PDF/A compliance - Githubissues

Lucas-C commented 3 years ago

I'm opening this issue to track work to ensure PDF/A-compliant can be generated using fpdf2.

Wikipedia page about PDF/A: https://en.wikipedia.org/wiki/PDF/A

PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents.

My current idea would to provide a get_pdfa_compliance() method that would return None or 'PDF/A-1' depending on several criteria:

document language is set
all images have alternate descriptions
no encryption is used
document has as an author, document title, creation data, and source program name in XMP metadata
not pdf.allow_images_transparency

the XMP metadata specifies the PDF/A level:

<rdf:Description rdf:about="" 
xmlns:pdfaExtension="http://www.aiim.org/pdfa/ns/extension/" 
xmlns:pdfaSchema="http://www.aiim.org/pdfa/ns/schema#" 
xmlns:pdfaProperty="http://www.aiim.org/pdfa/ns/property#" 
xmlns:pdfaType="http://www.aiim.org/pdfa/ns/type#" 
xmlns:pdfaField="http://www.aiim.org/pdfa/ns/field#" >
...
<rdf:Description rdf:about="" xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
 <pdfaid:part>1</pdfaid:part>
 <pdfaid:conformance>A</pdfaid:conformance>
</rdf:Description>

Feedback & all contributions are welcome on this subject

EconometricsBySimulation commented 1 year ago

@Lucas-C this is a great list. Sorry I did not see this topic earlier.

Here is a screenshot from the readout from Adobe's accessibility checker: Adobe Screenshot

This is generated from a pdf which has been generated using the following code:

from fpdf import FPDF
import lorem

pdf = FPDF()

pdf.set_title(f"Sample PDF")
pdf.set_lang("English")

pdf.add_page()
pdf.set_font("Arial", size=10)

pdf.cell(180, 10, txt="Welcome to a PDF generated in Python's fpdf2 package",align="C", new_y="NEXT", new_x="LMARGIN")

for i in range(5):
    print(i)
    pdf.multi_cell(180, 10, txt=f"{i+1}) " + lorem.paragraph(),align="L", new_y="NEXT", new_x="LMARGIN")

pdf.output("simple_demo.pdf")

In addition, according to Acrobat there needs to be

Tab order set.
Elements Tagged. To the best of my understanding this is requiring that elements be tagged as certain types of non-artifact content. I am not sure if I understand this exactly.

andersonhc commented 1 year ago

Can you try adding the code below in your sample code and see if you still get error on the title? I didn't research much about PDF/A but I suspect it demands the metadata as XMP

pdf.set_xmp_metadata("""<x:xmpmeta xmlns:x="adobe:ns:meta/">
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
            <dc:format>application/pdf</dc:format>
            <dc:title>
                <rdf:Alt>
                    <rdf:li xml:lang="x-default">Sample PDF</rdf:li>
                </rdf:Alt>
            </dc:title>
            <dc:language>
                <rdf:Bag>
                    <rdf:li>en-US</rdf:li>
                </rdf:Bag>
            </dc:language>
        </rdf:Description>
    </rdf:RDF>
</x:xmpmeta>""")

EconometricsBySimulation commented 1 year ago

I implemented the code as suggested and Adobe still flagged it. I "fixed" the Title and this is the metadata that pikepdf read from it:

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 9.1-c001 79.2a0d8d9, 2023/03/14-11:19:46        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">Sample PDF</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:language>
            <rdf:Bag>
               <rdf:li>en-US</rdf:li>
            </rdf:Bag>
         </dc:language>
         <xmp:ModifyDate>2023-06-01T21:45:58-04:00</xmp:ModifyDate>
         <xmp:MetadataDate>2023-06-01T21:45:58-04:00</xmp:MetadataDate>
         <xmpMM:DocumentID>uuid:cfac003d-eb66-694a-b654-1c17c505700b</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:2d2aa7ab-ab4f-1d42-9e8d-8019b057371d</xmpMM:InstanceID>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

I have tried to copy the metadata from a "corrected" pdf to that of a problematic pdf using PikePdf and I have had no success.

I also made an issue request under PikePdf: https://github.com/pikepdf/pikepdf/issues/469

py-pdf / fpdf2

PDF/A compliance #262