Closed owurman closed 1 year ago
A PDF file should look like this:
with:
A trailer giving the location of the cross-reference table and of certain special objects within the body of the file
The trailer of that file is empty, thus the error.
This command fixes it:
gs -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress 641-Attachment-B-Pediatric-Cardiac-Arrest-8-1-2019.pdf
Thanks. Is it a reasonable ask that a better message be given if the trailer is missing? I'm guessing that actually repairing the PDF as ghostscript does is beyond the scope of what you want pypdf to do...
The problem is not this specific case. Sure, we can (and regularly do) add robustness improvements. It's just a never ending story. There is an infinite number of ways the standard can be broken
I was hoping that we could use similar techniques as web browsers / beautiful soup does for HTML for that problem. I just didn't have the time to look into it so far.
@owurman I've prepared a PR to improve robustness if you want to try it.
The robustness improvement was just added to main
and will be released this weekend with pypdf>3.7.1
.
@owurman If you want I can add you as a contributor to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html
I was trying to get the pages for the attached PDF but received a KeyError: '/Root'. The file appears to be encrypted to me, but pdf.is_encrypted is False.
Environment
Which environment were you using when you encountered the problem?
Code + PDF
It's a public document so it should be fine to add to your tests.
Traceback