Closed Brich40 closed 1 year ago
@Brich40, without the original PDF and a representative/simple test code we can not analyse weither there is an issue. Can you provide those ?
@pubpub-zz,
Please find below the file (error_file_without_data.pdf) : https://we.tl/t-FEKXwhsZgQ
Also, this is an example of the python script :
#!/usr/bin/env python3
import pypdf
pdf_reader = pypdf.PdfReader('./error_file_without_data.pdf', strict=False)
pdf_page = pdf_reader.pages[0]
Output :
Invalid parent xref., rebuild xref
Object 2178 0 not defined.
Traceback (most recent call last):
File "/home/obr01/Documents/TMP/pdf/test_pdf.py", line 4, in <module>
pdf_page = pdf_reader.pages[0]
File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_page.py", line 2342, in __getitem__
len_self = len(self)
File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_page.py", line 2333, in __len__
return self.length_function()
File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_reader.py", line 452, in _get_num_pages
self._flatten()
File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_reader.py", line 1185, in _flatten
pages = catalog["/Pages"].get_object() # type: ignore
AttributeError: 'NoneType' object has no attribute 'get_object'
pypdf version : 3.5.1 Python version : 3.10.6
The file is web enhanced and the pages data is inaccessible to basic parsing here is the decompressed version, that should be more easily parsed, but I see you did similar with Pike (qpdf runner) you could use pdftk mutool qpdf or cpdf ?? faster / direct to decompress if thats all it needs (untested by me without any python to run error_file_without_data2.pdf
adding the file in the thread for testing: error_file_without_data.pdf
the issue has been identified and solved: I had to extend the range to to search for xref : this is met with linearized files PR in progress
@pubpub-zz Thank you for fixing this :pray:
The fix is merged to main
and will be in pypdf>3.5.1 (which I will release today)
@pubpub-zz, @MartinThoma,
Thank you for your work !
Hello,
I'm using pypdf to get pages from a pdf file, which is working fine for most of the files. But for a specific file, I'm getting the exception below:
Apparently this is comming from the value of "/Pages" in the Reader trailer, which is "None" for this file :
Output :
Is there any way to handle this case?
Thanks,