py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.36k stars 1.41k forks source link

KeyError: '/Width' when extracting an image #2070

Closed MartinThoma closed 1 year ago

MartinThoma commented 1 year ago

I was trying to extract an image and got an exception.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-155-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf.__version__)"
3.15.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("sample-files/009-pdflatex-geotopo/GeoTopo.pdf")
page_index = 30
image_key = "/Fm22"
actual_image = reader.pages[page_index].images[image_key]

Traceback

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2621, in __getitem__
    return self.get_function(index)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 534, in _get_image
    imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/filters.py", line 965, in _xobj_to_image
    size = (x_object_obj[IA.WIDTH], x_object_obj[IA.HEIGHT])
            ~~~~~~~~~~~~^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/generic/_data_structures.py", line 320, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '/Width'
pubpub-zz commented 1 year ago

@MartinThoma When I look at the object I get:

{'/Type': '/XObject', '/Subtype': '/Form', '/BBox': [0, 0, 14.173, 80.463], '/FormType': 1, '/Matrix': [1, 0, 0, 1, 0, 0], '/Resources': IndirectObject(767, 0, 1777394597648), '/Filter': '/FlateDecode'}

It is a /Form (i.e. an isolated content) not an image (the SubType shall be /Image)

if you look at the image list it is not in:

r.pages[30].images.keys()
  # -> ['/Im10']
pubpub-zz commented 1 year ago

I close this issue as operator failure 😉

MartinThoma commented 1 year ago

Uh, very interesting. I don't know what happened there :sweat_smile: