py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.43k stars 1.42k forks source link

PdfWriter().append throwing 'NullObject' object is not subscriptable for a specific PDF file #2958

Closed eth-wa closed 2 days ago

eth-wa commented 6 days ago

I am using pypdf to merge PDF files. I haven't had an issue, until I hit a specific PDF file. Every other PDF file (even ones from the same "bundle" [plan set]). I just have no idea why this PDF file is having an issue with the Writer, or if there is any sort of workaround/front end adjustment I can make to the PDF file.

Environment

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

import glob
import pypdf

fromp = '\\\\UNC\\Path\\*.pdf' #actual path exists, other PDF files merge fine

merger = pypdf.PdfWriter()

paths = glob.glob(fromp)
paths.sort()

for pdf in paths:
    merger.append(pdf) #Get error here on the problem PDF file

File link below: C0.00 - COVER SHEET.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Users\ethanwade\AppData\Roaming\Python\Python312\site-packages\pypdf\_writer.py", line 2638, in append
    self.merge(
  File "C:\Users\ethanwade\AppData\Roaming\Python\Python312\site-packages\pypdf\_writer.py", line 2796, in merge
    lst = self._insert_filtered_annotations(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ethanwade\AppData\Roaming\Python\Python312\site-packages\pypdf\_writer.py", line 2994, in _insert_filtered_annotations
    p = self._get_cloned_page(d[0], pages, reader)
                              ~^^^
TypeError: 'NullObject' object is not subscriptable
stefan6419846 commented 6 days ago

Thanks for the report. This specific PDF declares a destination of null, which does not look right:

58 0 obj
<<
/Border [0 0 0]
/A <<
/Type /Action
/S /GoTo
/D null
>>
/NM (AJNXWZZHGGPOHNQA)
/Rect [1995 2679 2007 2706]
/Subtype /Link
>>
endobj

Inserting

                if isinstance(d, NullObject):
                    continue

after https://github.com/py-pdf/pypdf/blob/bd5b962e81c7cd9ad29aa7ff7dedb4197326ebbb/pypdf/_writer.py#L3002 seems to fix this.