Open stojo opened 2 years ago
@stojo From your code I understand that you are creating from one pdf multiple files and the first file will contains only one page. Although you may not be able to share this one page document, can you at least provide the validation report from the website you've indicated.
@pubpub-zz yes, I am creating multiple files from one big file. But the first file does not contain only one page. The new files always contain minimum 4 pages.
Here is output from the website:
And here the full XML error report:
@stojo Do you have any PDF/A compliant document you can share? Can you adjust the example code in such a way that it is minimal and complete (e.g. has all imports and not half of a try-except block)?
@stojo Can you recheck with latest version.
@stojo +1?
I have also had this problem for a long time and now checked it again with version 3.5.1: The PDF version is now correctly declared as 1.7 (with older PyPDF2 versions it became 1.3). But unfortunately it still does not pass the check on https://avepdf.com/de/pdfa-validation.
<?xml version="1.0" encoding="UTF-8"?>
<ValidationReport>
<VersionInformation ID="GdPicture.NET.14" Version="14.2.19" />
<ValidationProfile Conformance="PDF/A" Part="1" Level="A" />
<FileInfo FileName="2023-03-05 TEST1_2.pdf" FileSize="10822 bytes" />
<ValidationResult IsCompliant="False" Statement="PDF file is not compliant with validation profile requirements." />
<Details>
<FailedChecks Count="8">
<Check ID="MissingXMPMetadata" OccurenceCount="1">
<Occurence Context="Document" Statement="Document XMP metadata is missing." ObjReference="None" />
</Check>
<Check ID="MissingMarkInfoDictionary" OccurenceCount="1">
<Occurence Context="Document" Statement="MarkInfo dictionary is missing." ObjReference="None" />
</Check>
<Check ID="MissingStructTreeRootDictionary" OccurenceCount="1">
<Occurence Context="Document" Statement="StructTreeRoot dictionary not found." ObjReference="None" />
</Check>
<Check ID="FileStructureMissingTrailerIDEntry" OccurenceCount="1">
<Occurence Context="Document" Statement="The file trailer is missing the ID array entry." ObjReference="None" />
</Check>
<Check ID="NoCidDSetEntry" OccurenceCount="4">
<Occurence Context="Page" PageNumber="2" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="19 0 obj" />
<Occurence Context="Page" PageNumber="2" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="11 0 obj" />
<Occurence Context="Page" PageNumber="3" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="19 0 obj" />
<Occurence Context="Page" PageNumber="3" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="11 0 obj" />
</Check>
</FailedChecks>
</Details>
</ValidationReport>
thank you for sharing this @geimist :heart: I haven't read "14.7 Logical Structure" before.
Here are a few documents that have it:
And some more:
The MarkInfo
almost always just contains {'/Marked': True}
, sometimes also '/LetterspaceFlags': 0
@pubpub-zz I only had a quick glance at "14.7 Logical Structure" so far, but this sounds interesting:
(Optional; PDF 1.4) Text that is an exact replacement for the structure element and its children. This replacement text (which should apply to as small a piece of content as possible) is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes (see 14.9.4, “Replacement Text”).
It sounds as if this might improve the text extraction in some cases a lot.
@geimist / @MartinThoma
Remember that the PDF/A requires some informations not only within the pages which are not linked and can not be copied in but also within part of the document global.
can you try with clone_document_from_reader()
if the results are better or not.
@pubpub-zz I only had a quick glance at "14.7 Logical Structure" so far, but this sounds interesting: (...) It sounds as if this might improve the text extraction in some cases a lot.
thanks for the tip. For the moment I have not been able to find how to use/extract this "replacement text".
Hi, I'm working on the same proejct as @geimist . I'm not shure if I unerstood it correctly, but I tried this:
from PyPDF2 import PdfReader, PdfWriter
def splitt_pdf(pdf_file_name:str, pages, new_name):
pdf_file_path = pdf_file_name
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfReader(pdf_file_path)
pdf_Writer = PdfWriter()
pdf_Writer.clone_document_from_reader(pdf)
file_out = f"{file_base_name}_{new_name}.pdf"
with open(file_out, 'wb') as f:
pdf_Writer.write(f)
f.close()
So no change at the Document. Just a clone from the reader. The original Document passes the validation, the cloned not.
@Gthorsten65 can you provide the original and output file please ?
Yes and no :-) I will do the same with a dokument with no personla data in it. The I will give you the files. Can I upload them here or how should i do this?
ok here they are: Test spiegel_A ist the one that passes the test, Test spiegel A_even fails. The 2nd one is produced with the above code Test_spiegel_A.pdf test_spiegel_A_even.pdf
Sorry forget my comments. It is working. The Problem from myside was using pypdf2 :-( With pypdf it is working
hmm, ok sorry now I tested it with that what we want to do: And the validation error comes back.
def splitt_pdf(pdf_file_name:str, pages, new_name):
pdf_file_path = pdf_file_name
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfReader(pdf_file_path)
# pages = [1, 3, 5] # page 1, 3, 5
pdf_Writer = PdfWriter()
pdf_Writer.clone_reader_document_root(pdf)
#pdf_Writer.clone_document_from_reader(pdf)
for page_num in pages:
pdf_Writer.add_page(pdf.pages[page_num-1])
file_out = f"{file_base_name}_{new_name}.pdf"
with open(file_out, 'wb') as f:
pdf_Writer.write(f)
f.close()
If I just use clone_document_from_reader and then write it to disk, the Dcument Test works. But If I use clone_reader_document_root and add then my needed pages with pdf_Writer.add_page(), write it then to file, the check fails.
Even clone_document_from_reader and then add pages ( from my understanding this is not correct, because I want to add only some pages), the test fails.
So actually the only way is to use clone_document_from_reader. But then I have to much pages, because I want to split one document into 2 Documents.
So do I have a misunderstanding, or whats going wrong on my side?
Since some versions of PyPDF2, the pdf documents that I split and regenerate are loosing PDF/A confirmation (checked with https://avepdf.com/pdfa-validation). Those documents are not accepted by certain applications that check the documents for PDF/A (e.g. DocuSign). It works fine with former versions like PyPDF2 1.28.4.
Maybe helpful (?): The size of the documents split with the newest version of PyPDF2 is less (about 10kb) than files generated with former versions.
Environment
Windows-10-10.0.19042-SP0 PyPDF2==2.10.8
Code (PDFs containing confidential content and therefore not sharable)
This is a minimal, complete example that shows the issue: