Closed Piloudev closed 1 year ago
Generally, detecting if a PDF file completely matches the specs probably is rather complex and from my experience, lots of files would fail this. Some PDFs might be partially invalid and could be fixed, while others cannot.
The easiest solution I see with pypdf is to create a PdfReader
and doing some basic stuff with it. If there are exceptions, your PDF file most likely is invalid.
@pubpub-zz @MartinThoma any other suggestion ? :)
the idea @stefan6419846 proposed is also what I would propose: create a PdfReader, for each page extract text, images collecting possible exceptions; if any exception is raised it means the Pdf is quite damaged.
From what I have observed : a) some exceptions raised may not mean you will not be able to extract what you need (the system may be able to recover from them or the damage may be on data I do not need) b) When you open a failed PDF with acrobat reader, most of the time it only fails when you are reaching the failed page.
PS : this is not an issue I convert it as a discussion
In my Python program, I need to need to know if a PDF is corrupted or not. How can I achieve it with this library ? Is it possible or not for the moment ?
One case to study, get a PDF file (a valid one). Open the content in an editor, copy paste it in a pdf_file.txt, convert this .txt to .pdf When you double click on it, the file can not be open in a PDF viewer (it tells you that it's corrupted)
What I would like to do is to have a function to know if a PDF is valid or corrupted
Thanks, Have a good day :)