jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

`KeyError` raised when `laparams` set #383

Closed alexreg closed 3 years ago

alexreg commented 3 years ago

Describe the bug

I get the following exception when calling page.objects["char"]. Note, this only occurs when I open the PDF with laparams set.

  File "foo.py", line 30
    chars = left_column.objects["char"]
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/page.py", line 358, in objects
    self._objects = self.crop_fn(self.parent_page.objects, self.bbox)
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/page.py", line 358, in objects
    self._objects = self.crop_fn(self.parent_page.objects, self.bbox)
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 477, in crop_to_bbox
    return dict((k, crop_to_bbox(v, bbox)) for k, v in objs.items())
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 477, in <genexpr>
    return dict((k, crop_to_bbox(v, bbox)) for k, v in objs.items())
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 481, in crop_to_bbox
    cropped = list(filter(None, (clip_obj(obj, bbox) for obj in objs)))
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 481, in <genexpr>
    cropped = list(filter(None, (clip_obj(obj, bbox) for obj in objs)))
  File "/usr/local/lib/python3.9/site-packages/pdfplumber/utils.py", line 422, in clip_obj
    overlap = get_bbox_overlap(obj_to_bbox(obj), bbox)
KeyError: 'x0'

Code to reproduce the problem

with pdfplumber.open("serials.pdf", laparams = {}) as pdf:
    for page in pdf.pages:
        contents = page.crop(
            (
                Decimal(100),
                Decimal(70 + 200 if page.page_number == 1 else 0),
                page.width - Decimal(100),
                page.height - Decimal(70),
            ),
        )
        left_column = contents.crop(
            (
                Decimal(0),
                Decimal(0),
                contents.width * Decimal(0.5),
                contents.height,
            ),
            relative = True,
        )
        right_column = contents.crop(
            (
                contents.width * Decimal(0.5),
                Decimal(0),
                contents.width,
                contents.height,
            ),
            relative = True,
        )

        chars = left_column.objects["char"]

PDF file

https://mathscinet.ams.org/msnhtml/serials.pdf

Expected behavior

No error (exception) should be raised.

Actual behavior

The above exception (KeyError) is raised.

Screenshots

left_column: left_column

right_column: right_column

Environment

Additional context

None

jsvine commented 3 years ago

Thanks for flagging. I'll take a look.

jsvine commented 3 years ago

Thanks for flagging this, @alexreg. Commit/PR above should handle this. I'll close this issue when when/if the PR is merged.

jsvine commented 3 years ago

This was fixed by the PR above; belatedly closing this issue.