py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.36k stars 1.41k forks source link

Invalid image lookup tables not handled correctly #2110

Closed stefan6419846 closed 1 year ago

stefan6419846 commented 1 year ago

Invalid image lookup tables do not seem to be handled correctly and might end up trying to iterate over None: https://github.com/py-pdf/pypdf/blob/89eb626a7a7e22937b9216e817f5882431196b24/pypdf/filters.py#L900-L926

Here you can see that in line 905 the lookup table will be set to None, but both line 908 and lines 915-916 try to iterate over a possibly None value. The condition in line 924 is too late to prevent issues.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.14.21-150400.24.81-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf.__version__)"
3.15.2

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

for page in PdfReader('file.pdf').pages:
    for key in page.images.keys():
        print(key)
        page.images[key].image.convert('RGB').save(key[1:] + '.png')

Traceback

This is the complete traceback I see:

Invalid Lookup Table in {'/BitsPerComponent': 8, '/ColorSpace': IndirectObject(37, 0, 140090665353664), '/Filter': '/FlateDecode', '/Height': 77, '/Subtype': '/Image', '/Type': '/XObject', '/Width': 106}
Traceback (most recent call last):
  File "/home/stefan/temp/run.py", line 9, in <module>
    print(page.images[key].indirect_reference)
  File "/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 2636, in __getitem__
    return self.get_function(index)
  File "/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 544, in _get_image
    imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
  File "/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/filters.py", line 1026, in _xobj_to_image
    img, image_format, extension, invert_color = _handle_flate(
  File "/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/filters.py", line 916, in _handle_flate
    for n in range(0, 4 * (len(lookup) // 4), 4)
TypeError: object of type 'NoneType' has no len()
pubpub-zz commented 1 year ago

@stefan6419846 can you provide the pdf file?

stefan6419846 commented 1 year ago

A reproducing file has been sent to Martin directly for privacy reasons.