py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.08k stars 1.39k forks source link

AttributeError: 'NoneType' object has no attribute 'get_object' #1689

Closed Brich40 closed 1 year ago

Brich40 commented 1 year ago

Hello,

I'm using pypdf to get pages from a pdf file, which is working fine for most of the files. But for a specific file, I'm getting the exception below:

  File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_page.py", line 2342, in __getitem__
    len_self = len(self)
  File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_page.py", line 2333, in __len__
    return self.length_function()
  File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_reader.py", line 452, in _get_num_pages
    self._flatten()
  File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_reader.py", line 1185, in _flatten
    pages = catalog["/Pages"].get_object()  # type: ignore
AttributeError: 'NoneType' object has no attribute 'get_object'

Apparently this is comming from the value of "/Pages" in the Reader trailer, which is "None" for this file :


pdf_reader = pypdf.PdfReader(file_path, strict=False)
print(pdf_reader.trailer['/Root']['/Pages']) 

Output :

Object 2178 0 not defined.
None

Is there any way to handle this case?

Thanks,

pubpub-zz commented 1 year ago

@Brich40, without the original PDF and a representative/simple test code we can not analyse weither there is an issue. Can you provide those ?

Brich40 commented 1 year ago

@pubpub-zz,

Please find below the file (error_file_without_data.pdf) : https://we.tl/t-FEKXwhsZgQ

Also, this is an example of the python script :

#!/usr/bin/env python3

import pypdf

pdf_reader = pypdf.PdfReader('./error_file_without_data.pdf', strict=False)
pdf_page = pdf_reader.pages[0] 

Output :

Invalid parent xref., rebuild xref
Object 2178 0 not defined.
Traceback (most recent call last):
  File "/home/obr01/Documents/TMP/pdf/test_pdf.py", line 4, in <module>
    pdf_page = pdf_reader.pages[0]
  File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_page.py", line 2342, in __getitem__
    len_self = len(self)
  File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_page.py", line 2333, in __len__
    return self.length_function()
  File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_reader.py", line 452, in _get_num_pages
    self._flatten()
  File "/home/obr01/python-venv/opencapture/lib/python3.10/site-packages/pypdf/_reader.py", line 1185, in _flatten
    pages = catalog["/Pages"].get_object()  # type: ignore
AttributeError: 'NoneType' object has no attribute 'get_object'

pypdf version : 3.5.1 Python version : 3.10.6

GitHubRulesOK commented 1 year ago

The file is web enhanced and the pages data is inaccessible to basic parsing here is the decompressed version, that should be more easily parsed, but I see you did similar with Pike (qpdf runner) you could use pdftk mutool qpdf or cpdf ?? faster / direct to decompress if thats all it needs (untested by me without any python to run error_file_without_data2.pdf

pubpub-zz commented 1 year ago

adding the file in the thread for testing: error_file_without_data.pdf

the issue has been identified and solved: I had to extend the range to to search for xref : this is met with linearized files PR in progress

MartinThoma commented 1 year ago

@pubpub-zz Thank you for fixing this :pray:

The fix is merged to main and will be in pypdf>3.5.1 (which I will release today)

Brich40 commented 1 year ago

@pubpub-zz, @MartinThoma,

Thank you for your work !