Open MartinThoma opened 2 years ago
The metadata section that is missing looks like this in the original file:
1 0 obj
<< /Metadata 3 0 R /Outlines 4 0 R /OutputIntents [ << /DestOutputProfile 5 0 R /Info (ISO Coated v2 \(ECI\)) /OutputConditionIdentifier (ISO Coated v2 \(ECI\)) /RegistryName (http://www.color.org) /S /GTS_PDFX /Type /OutputIntent >> << /DestOutputProfile 5 0 R /Info (ISO Coated v2 \(ECI\)) /OutputConditionIdentifier (ISO Coated v2 \(ECI\)) /RegistryName (http://www.color.org) /S /GTS_PDFA1 /Type /OutputIntent >> ] /PageLabels 6 0 R /Pages 7 0 R /Type /Catalog /ViewerPreferences << /Direction /L2R >> >>
endobj
2 0 obj
<< /Author (PDF/A Competence Center) /CreationDate (D:20110818145925+02'00') /Creator (Adobe InDesign CS5 \(7.0.4\)) /GTS_PDFXConformance (PDF/X-1a:2003) /GTS_PDFXVersion (PDF/X-1a:2003) /ModDate (D:20110818150035+02'00') /Producer (Adobe PDF Library 9.9) /Title (PDF/A in a Nutshell) /Trapped /False >>
endobj
3 0 obj
<< /Subtype /XML /Type /Metadata /Length 12889 >>
stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:56:37 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#">
<xmpMM:InstanceID>uuid:2b5a9f85-f518-7b4a-a756-79898ed7b891</xmpMM:InstanceID>
<xmpMM:OriginalDocumentID>adobe:docid:indd:fbe35371-5d32-11dc-b86a-8404c5e05271</xmpMM:OriginalDocumentID>
<xmpMM:DocumentID>adobe:docid:indd:fbe35371-5d32-11dc-b86a-8404c5e05271</xmpMM:DocumentID>
<xmpMM:RenditionClass>proof:pdf</xmpMM:RenditionClass>
<xmpMM:VersionID>1</xmpMM:VersionID>
<xmpMM:History>
<rdf:Seq>
<rdf:li rdf:parseType="Resource">
<stEvt:action>converted</stEvt:action>
<stEvt:instanceID>uuid:d6a6d0f6-aead-f743-bda5-13d4cf5c3c11</stEvt:instanceID>
<stEvt:parameters>converted to PDF/A-1b</stEvt:parameters>
<stEvt:softwareAgent>pdfaPilot</stEvt:softwareAgent>
<stEvt:when>2011-08-18T15:00:32+02:00</stEvt:when>
</rdf:li>
</rdf:Seq>
</xmpMM:History>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreatorTool>Adobe InDesign CS5 (7.0.4)</xmp:CreatorTool>
<xmp:CreateDate>2011-08-18T14:59:25+02:00</xmp:CreateDate>
<xmp:ModifyDate>2011-08-18T15:00:35+02:00</xmp:ModifyDate>
<xmp:MetadataDate>2011-08-18T15:00:35+02:00</xmp:MetadataDate>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">PDF/A in a Nutshell</rdf:li>
</rdf:Alt>
</dc:title>
<dc:creator>
<rdf:Seq>
<rdf:li>PDF/A Competence Center</rdf:li>
</rdf:Seq>
</dc:creator>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Adobe PDF Library 9.9</pdf:Producer>
<pdf:Trapped>False</pdf:Trapped>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:pdfxid="http://www.npes.org/pdfx/ns/id/">
<pdfxid:GTS_PDFXVersion>PDF/X-1a:2003</pdfxid:GTS_PDFXVersion>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/">
<pdfx:GTS_PDFXVersion>PDF/X-1a:2003</pdfx:GTS_PDFXVersion>
<pdfx:GTS_PDFXConformance>PDF/X-1a:2003</pdfx:GTS_PDFXConformance>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
<pdfaid:part>1</pdfaid:part>
<pdfaid:conformance>B</pdfaid:conformance>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:pdfaExtension="http://www.aiim.org/pdfa/ns/extension/"
xmlns:pdfaSchema="http://www.aiim.org/pdfa/ns/schema#"
xmlns:pdfaProperty="http://www.aiim.org/pdfa/ns/property#">
<pdfaExtension:schemas>
<rdf:Bag>
<rdf:li rdf:parseType="Resource">
<pdfaSchema:namespaceURI>http://ns.adobe.com/pdf/1.3/</pdfaSchema:namespaceURI>
<pdfaSchema:prefix>pdf</pdfaSchema:prefix>
<pdfaSchema:schema>Adobe PDF</pdfaSchema:schema>
<pdfaSchema:property>
<rdf:Seq>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>A name object indicating whether the document has been modified to include trapping information</pdfaProperty:description>
<pdfaProperty:name>Trapped</pdfaProperty:name>
<pdfaProperty:valueType>Text</pdfaProperty:valueType>
</rdf:li>
</rdf:Seq>
</pdfaSchema:property>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaSchema:namespaceURI>http://ns.adobe.com/pdfx/1.3/</pdfaSchema:namespaceURI>
<pdfaSchema:prefix>pdfx</pdfaSchema:prefix>
<pdfaSchema:schema>PDF/X ID Schema</pdfaSchema:schema>
<pdfaSchema:property>
<rdf:Seq>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>ID of PDF/X standard</pdfaProperty:description>
<pdfaProperty:name>GTS_PDFXVersion</pdfaProperty:name>
<pdfaProperty:valueType>Text</pdfaProperty:valueType>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>Conformance level of PDF/X standard</pdfaProperty:description>
<pdfaProperty:name>GTS_PDFXConformance</pdfaProperty:name>
<pdfaProperty:valueType>Text</pdfaProperty:valueType>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>Company creating the PDF</pdfaProperty:description>
<pdfaProperty:name>Company</pdfaProperty:name>
<pdfaProperty:valueType>Text</pdfaProperty:valueType>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>Date when document was last modified</pdfaProperty:description>
<pdfaProperty:name>SourceModified</pdfaProperty:name>
<pdfaProperty:valueType>Text</pdfaProperty:valueType>
</rdf:li>
</rdf:Seq>
</pdfaSchema:property>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaSchema:namespaceURI>http://ns.adobe.com/xap/1.0/mm/</pdfaSchema:namespaceURI>
<pdfaSchema:prefix>xmpMM</pdfaSchema:prefix>
<pdfaSchema:schema>XMP Media Management</pdfaSchema:schema>
<pdfaSchema:property>
<rdf:Seq>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>UUID based identifier for specific incarnation of a document</pdfaProperty:description>
<pdfaProperty:name>InstanceID</pdfaProperty:name>
<pdfaProperty:valueType>URI</pdfaProperty:valueType>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>The common identifier for all versions and renditions of a document.</pdfaProperty:description>
<pdfaProperty:name>OriginalDocumentID</pdfaProperty:name>
<pdfaProperty:valueType>URI</pdfaProperty:valueType>
</rdf:li>
</rdf:Seq>
</pdfaSchema:property>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaSchema:namespaceURI>http://www.aiim.org/pdfa/ns/id/</pdfaSchema:namespaceURI>
<pdfaSchema:prefix>pdfaid</pdfaSchema:prefix>
<pdfaSchema:schema>PDF/A ID Schema</pdfaSchema:schema>
<pdfaSchema:property>
<rdf:Seq>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>Part of PDF/A standard</pdfaProperty:description>
<pdfaProperty:name>part</pdfaProperty:name>
<pdfaProperty:valueType>Integer</pdfaProperty:valueType>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>Amendment of PDF/A standard</pdfaProperty:description>
<pdfaProperty:name>amd</pdfaProperty:name>
<pdfaProperty:valueType>Text</pdfaProperty:valueType>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>Conformance level of PDF/A standard</pdfaProperty:description>
<pdfaProperty:name>conformance</pdfaProperty:name>
<pdfaProperty:valueType>Text</pdfaProperty:valueType>
</rdf:li>
</rdf:Seq>
</pdfaSchema:property>
</rdf:li>
<rdf:li rdf:parseType="Resource">
<pdfaSchema:namespaceURI>http://www.npes.org/pdfx/ns/id/</pdfaSchema:namespaceURI>
<pdfaSchema:prefix>pdfxid</pdfaSchema:prefix>
<pdfaSchema:schema>PDF/X ID Schema</pdfaSchema:schema>
<pdfaSchema:property>
<rdf:Seq>
<rdf:li rdf:parseType="Resource">
<pdfaProperty:category>internal</pdfaProperty:category>
<pdfaProperty:description>ID of PDF/X standard</pdfaProperty:description>
<pdfaProperty:name>GTS_PDFXVersion</pdfaProperty:name>
<pdfaProperty:valueType>Text</pdfaProperty:valueType>
</rdf:li>
</rdf:Seq>
</pdfaSchema:property>
</rdf:li>
</rdf:Bag>
</pdfaExtension:schemas>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
The file trailer of the original looks like this:
trailer << /Info 2 0 R /Root 1 0 R /Size 810 /ID [<cc24aff220034a578f97b292897ecfb3><482975899621bd41f614731fdea45046>] >>
whereas the merged one looks like this:
trailer << /Info 2 0 R /Root 1 0 R /Size 783 /ID [<1916dc78292f2472ad342a4f13641c7f><1916dc78292f2472ad342a4f13641c7f>] >>
So we do have the ID keyword, but we violate
(isLinearized == true && firstPageID != null) || ((isLinearized != true) && lastID != null)
https://avepdf.com/pdfa-validation might also help us
This probably should be updated to reflect the deprecation of PdfMerger in favor PdfWriter.
According to pdfinfo PDFA-in-a-Nutshell_1b.pdf
states its conformance as Level A, Accessible
, as well as the generated PDF file:
>>> from pypdf import PdfReader, PdfWriter
>>> reader = PdfReader('PDFA-in-a-Nutshell_1b.pdf')
>>> metadata = reader.metadata
>>> writer = PdfWriter(clone_from=reader)
>>> writer.add_metadata(metadata)
>>> writer.write('merged.pdf')
(True, <_io.FileIO [closed]>)
>>>
Running this through VeraPDF with the PDF/A-1A profile, I get some different issues:
Using the automatically detected profile (PDF/A-1B) only item 5 is being reported.
@stefan6419846 We have to try with Incremental writing. also the only way to ensure the output meets PDF/A-1A would be to have inputs respecting the standard.
What do you think we should do about this issue ? close as it as not planned ?
We are recommending the PdfWriter
as the replacement for PdfMerger
. Thus, I would recommend to at least verify that given the above PDF/A-compliant document, using PdfWriter(clone_from="PDFA-in-a-Nutshell_1b.pdf").save("out.pdf")
does not destroy the document (possibly using the incremental mode) and verifying that this is indeed documented properly.
The output file with the following code passed!
writer = PdfWriter(clone_from="PDFA-in-a-Nutshell_1b.pdf")
writer.write("out.pdf")
Ideally, we find a way to check this in CI as well to ensure that our changes do not accidentally break anything about this.
That is certainly true. That would be a powerful checking tool.
fpdf2 seems to already have some parts of this implemented in the CI, although ignoring PDF/A issues: https://github.com/py-pdf/fpdf2/blob/7784099dadeec551aa78511c06a6d7f525428265/.github/workflows/continuous-integration-workflow.yml#L45-L58
Ideally, we find a way to check this in CI as well to ensure that our changes do not accidentally break anything about this.
We should
The output file with the following code passed!
writer = PdfWriter(clone_from="PDFA-in-a-Nutshell_1b.pdf") writer.write("out.pdf")
Can you indicate against which standard you've checked the document and using which tool/website ? I've tried verapdf and still got some errors in the XMP form (present in the original)
Can you indicate against which standard you've checked the document and using which tool/website ? I've tried verapdf and still got some errors in the XMP form (present in the original)
Sorry, I chose “PDF/A-1b Basic”. I should have chosen “PDF/A-1a”.
Ideally, we find a way to check this in CI as well to ensure that our changes do not accidentally break anything about this.
We should
Seems like some words got lost here? ;)
We should/might prepare a dedicated set of tests to confirm. however I see two limitation:
a) verapdf could be a candidate : we need to set it in a workflow ?
b) we need to identify files that are passing at least PDF/A-1a but preferably go to PDF/A-2 a/b/u : We will have to be clear that pypdf has no capability to automatically create/convert to a file compliant with PDF/A standard
There are indeed multiple ways for verification. veraPDF is a Java application and should be no real issue in CI.
For the PDF/A standard, we should start with a basic example like the file initially referenced in this issue. IMHO we never claimed that we would be able to generate such a file and I have no plans to change this for now. This does not prevent us from running basic validation like mentioned before, id est that passing through an existing PDF/A file does not break just to document the current behavior to avoid side effects of other changes.
Use PdfMerger with a single PDF/A compliant document I would expect almost exactly the same output file as the input file. But it's way different - and PDF/A compliance is broken.
Code + PDF
Using this as an example document: https://www.pdfa.org/wp-content/uploads/2011/08/PDFA-in-a-Nutshell_1b.pdf
And https://demo.verapdf.org/ to verify if the document is compliant.
Issues
PDFA-in-a-Nutshell_1b.pdf
has 6.4 MB and is PDF/A compliantmerged.pdf
has 5.0 MB and is NOT PDF/compliantverapdf.org mentions that
100
issues were detected. It lists the following 3: