How to verify corrupted PDF file

py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

https://pypdf.readthedocs.io/en/latest/

Other

8.36k stars 1.41k forks source link

How to verify corrupted PDF file #2202

Closed Piloudev closed 1 year ago

Piloudev commented 1 year ago

In my Python program, I need to need to know if a PDF is corrupted or not. How can I achieve it with this library ? Is it possible or not for the moment ?

One case to study, get a PDF file (a valid one). Open the content in an editor, copy paste it in a pdf_file.txt, convert this .txt to .pdf When you double click on it, the file can not be open in a PDF viewer (it tells you that it's corrupted)

What I would like to do is to have a function to know if a PDF is valid or corrupted

Thanks, Have a good day :)

stefan6419846 commented 1 year ago

Generally, detecting if a PDF file completely matches the specs probably is rather complex and from my experience, lots of files would fail this. Some PDFs might be partially invalid and could be fixed, while others cannot.

The easiest solution I see with pypdf is to create a PdfReader and doing some basic stuff with it. If there are exceptions, your PDF file most likely is invalid.

Piloudev commented 1 year ago

@pubpub-zz @MartinThoma any other suggestion ? :)

pubpub-zz commented 1 year ago

the idea @stefan6419846 proposed is also what I would propose: create a PdfReader, for each page extract text, images collecting possible exceptions; if any exception is raised it means the Pdf is quite damaged.

From what I have observed : a) some exceptions raised may not mean you will not be able to extract what you need (the system may be able to recover from them or the damage may be on data I do not need) b) When you open a failed PDF with acrobat reader, most of the time it only fails when you are reaching the failed page.

pubpub-zz commented 1 year ago

PS : this is not an issue I convert it as a discussion