PDF/A confirmation broken after splitting and creating new pdf documents

stojo commented 2 years ago

Since some versions of PyPDF2, the pdf documents that I split and regenerate are loosing PDF/A confirmation (checked with https://avepdf.com/pdfa-validation). Those documents are not accepted by certain applications that check the documents for PDF/A (e.g. DocuSign). It works fine with former versions like PyPDF2 1.28.4.

Maybe helpful (?): The size of the documents split with the newest version of PyPDF2 is less (about 10kb) than files generated with former versions.

Environment

Windows-10-10.0.19042-SP0 PyPDF2==2.10.8

Code (PDFs containing confidential content and therefore not sharable)

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfFileWriter, PdfFileReader

def read_and_split_document():
    # initialize pdf reader
    try:
        print(">> Reading document....")
        inputpdf = PdfFileReader(open("conventions.pdf", "rb"))
        outlines = inputpdf.getOutlines() 
        sites = inputpdf.numPages
        # ...
        # some more code with different operations following (not relevant for this issue)
        # ...
        output = PdfFileWriter()
        for j in range(start,end-1):
                output.addPage(inputpdf.getPage(j))
                with open("documents/"+page_list[i].get("name")+".pdf", "wb") as outputStream:
                    output.write(outputStream)

pubpub-zz commented 2 years ago

@stojo From your code I understand that you are creating from one pdf multiple files and the first file will contains only one page. Although you may not be able to share this one page document, can you at least provide the validation report from the website you've indicated.

stojo commented 2 years ago

@pubpub-zz yes, I am creating multiple files from one big file. But the first file does not contain only one page. The new files always contain minimum 4 pages.

Here is output from the website:

And here the full XML error report:

MartinThoma commented 2 years ago

@stojo Do you have any PDF/A compliant document you can share? Can you adjust the example code in such a way that it is minimal and complete (e.g. has all imports and not half of a try-except block)?

pubpub-zz commented 1 year ago

@stojo Can you recheck with latest version.

pubpub-zz commented 1 year ago

@stojo +1?

geimist commented 1 year ago

I have also had this problem for a long time and now checked it again with version 3.5.1: The PDF version is now correctly declared as 1.7 (with older PyPDF2 versions it became 1.3). But unfortunately it still does not pass the check on https://avepdf.com/de/pdfa-validation.

Bildschirmfoto 2023-03-07 um 13 18 29

<?xml version="1.0" encoding="UTF-8"?>
<ValidationReport>
    <VersionInformation ID="GdPicture.NET.14" Version="14.2.19" />
    <ValidationProfile Conformance="PDF/A" Part="1" Level="A" />
    <FileInfo FileName="2023-03-05  TEST1_2.pdf" FileSize="10822 bytes" />
    <ValidationResult IsCompliant="False" Statement="PDF file is not compliant with validation profile requirements." />
    <Details>
        <FailedChecks Count="8">
            <Check ID="MissingXMPMetadata" OccurenceCount="1">
                <Occurence Context="Document" Statement="Document XMP metadata is missing." ObjReference="None" />
            </Check>
            <Check ID="MissingMarkInfoDictionary" OccurenceCount="1">
                <Occurence Context="Document" Statement="MarkInfo dictionary is missing." ObjReference="None" />
            </Check>
            <Check ID="MissingStructTreeRootDictionary" OccurenceCount="1">
                <Occurence Context="Document" Statement="StructTreeRoot dictionary not found." ObjReference="None" />
            </Check>
            <Check ID="FileStructureMissingTrailerIDEntry" OccurenceCount="1">
                <Occurence Context="Document" Statement="The file trailer is missing the ID array entry." ObjReference="None" />
            </Check>
            <Check ID="NoCidDSetEntry" OccurenceCount="4">
                <Occurence Context="Page" PageNumber="2" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="19 0 obj" />
                <Occurence Context="Page" PageNumber="2" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="11 0 obj" />
                <Occurence Context="Page" PageNumber="3" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="19 0 obj" />
                <Occurence Context="Page" PageNumber="3" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="11 0 obj" />
            </Check>
        </FailedChecks>
    </Details>
</ValidationReport>

MartinThoma commented 1 year ago

thank you for sharing this @geimist :heart: I haven't read "14.7 Logical Structure" before.

Here are a few documents that have it:

pypdf/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf
pypdf/resources/git.pdf
pypdf/resources/issue-604.pdf
pypdf/resources/issue-914-xmp-data.pdf
pypdf/tests/pdf_cache/book_471.pdf
pypdf/tests/pdf_cache/BreezeMan1.pdf
pypdf/tests/pdf_cache/BreezeMan2.pdf
pypdf/tests/pdf_cache/budgeting-loan-form-sf500.pdf
pypdf/tests/pdf_cache/GeoBaseWithComments.pdf
pypdf/tests/pdf_cache/Giacalone.pdf
pypdf/tests/pdf_cache/iss_1134.pdf
pypdf/tests/pdf_cache/iss1689.pdf
pypdf/tests/pdf_cache/issue_416.pdf
pypdf/tests/pdf_cache/PDF32000_2008.pdf
pypdf/tests/pdf_cache/pypdf-5536984.pdf
pypdf/tests/pdf_cache/st2019.pdf
pypdf/tests/pdf_cache/test_write_outline_item_on_page_fitv.pdf
pypdf/tests/pdf_cache/tika-906769.pdf
pypdf/tests/pdf_cache/tika-911260.pdf
pypdf/tests/pdf_cache/tika-914568.pdf
pypdf/tests/pdf_cache/tika-918137.pdf
pypdf/tests/pdf_cache/tika-923621.pdf
pypdf/tests/pdf_cache/tika-934771.pdf
pypdf/tests/pdf_cache/tika-935981.pdf
pypdf/tests/pdf_cache/tika-941536.pdf
pypdf/tests/pdf_cache/tika-942050.pdf
pypdf/tests/pdf_cache/tika-953770.pdf
pypdf/tests/pdf_cache/tika-959173.pdf
pypdf/tests/pdf_cache/tika-959519.pdf
pypdf/tests/pdf_cache/tika-972174.pdf
pypdf/tests/pdf_cache/tika-972962.pdf
pypdf/tests/pdf_cache/tika-980613.pdf
pypdf/tests/pdf_cache/tika-988698.pdf
pypdf/tests/pdf_cache/tika-992472.pdf
pypdf/tests/pdf_cache/tst_iss1631.pdf

And some more:

The MarkInfo almost always just contains {'/Marked': True}, sometimes also '/LetterspaceFlags': 0

MartinThoma commented 1 year ago

@pubpub-zz I only had a quick glance at "14.7 Logical Structure" so far, but this sounds interesting:

(Optional; PDF 1.4) Text that is an exact replacement for the structure element and its children. This replacement text (which should apply to as small a piece of content as possible) is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes (see 14.9.4, “Replacement Text”).

It sounds as if this might improve the text extraction in some cases a lot.

pubpub-zz commented 1 year ago

@geimist / @MartinThoma Remember that the PDF/A requires some informations not only within the pages which are not linked and can not be copied in but also within part of the document global. can you try with clone_document_from_reader() if the results are better or not.

pubpub-zz commented 1 year ago

@pubpub-zz I only had a quick glance at "14.7 Logical Structure" so far, but this sounds interesting: (...) It sounds as if this might improve the text extraction in some cases a lot.

thanks for the tip. For the moment I have not been able to find how to use/extract this "replacement text".