py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.3k stars 1.41k forks source link

#7 Using PdfReader causes a crash #2886

Open Avgor46 opened 1 month ago

Avgor46 commented 1 month ago

Hi!

I've found IndexError when pdf file is relatively large. Necessary information is provided below.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.0-56-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.0.1, crypt_provider=('cryptography', '3.1'), PIL=none

commit 8e1799e7836e77a637291dea9cb766b6a873f43b

Code + PDF

This is a minimal, complete example that shows the issue:

#! /usr/bin/env python3

import pypdf
from pypdf.errors import EmptyFileError, PdfReadError, PdfStreamError
import sys

def TestOneInput(fname):
  try:
    pdf_reader = pypdf.PdfReader(fname)
    for page_number, page in enumerate(pdf_reader.pages):
        page.extract_text()
  except (EmptyFileError, PdfReadError, PdfStreamError):
      pass

if __name__ == "__main__":
    if len(sys.argv) < 2:
        exit(1)
    TestOneInput(sys.argv[1])

PoC

crash-e8a85d82de01cab5eb44e7993304d8b9d1544970.pdf

Traceback

This is the complete stderr I see:

entry <entry> in Xref table invalid but object found
...
entry <entry> in Xref table invalid; object not found
Traceback (most recent call last):
  File "/fuzz/./poc.py", line 18, in <module>
    TestOneInput(sys.argv[1])
  File "/fuzz/./poc.py", line 9, in TestOneInput
    pdf_reader = pypdf.PdfReader(fname)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 132, in __init__
    self._initialize_stream(stream)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 154, in _initialize_stream
    self.read(stream)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 615, in read
    self._read_xref_tables_and_trailers(stream, startxref, xref_issue_nr)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 871, in _read_xref_tables_and_trailers
    startxref = self._read_xref(stream)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 910, in _read_xref
    self._read_standard_xref_table(stream)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_reader.py", line 781, in _read_standard_xref_table
    while line[0] in b"\x0D\x0A":
IndexError: index out of range
sgonsal commented 5 days ago

@stefan6419846 Maybe we can add one check after this line to see if line is an empty list? https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_reader.py#L779

stefan6419846 commented 4 days ago

Feel free to submit a corresponding PR. As long as all tests (including a new one) keep passing, I do not see an issue with merging it.