jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

crop + extract_text() raises KeyError when laparams is not set to None in pdfplumber.open #390

Closed LiutongZhou closed 2 years ago

LiutongZhou commented 3 years ago

How to reproduce the Error

import pdfplumber

assert pdfplumber.__version__ == "0.5.27"

!wget https://s22.q4cdn.com/407748750/files/doc_financials/2020/ar/2020-Proxy-Card.pdf -O Some.pdf --no-check-certificate

with pdfplumber.open("Some.pdf", laparams={}) as pdf:   #Open the pdf with laparams != None 
    page = pdf.pages[0]
    box_coordinates = (0,0, float(page.width), float(page.height))
    crop = page.crop(box_coordinates)
    text = crop.extract_text()
python3.8/site-packages/pdfplumber/utils.py in clip_obj(obj, bbox)
    420     bbox = decimalize(bbox)
    421 
--> 422     overlap = get_bbox_overlap(obj_to_bbox(obj), bbox)
    423     if overlap is None:
    424         return None

KeyError: 'x0'

Expected Behavior

Return the text of the page

Working Fix


# Line pdfplumber/utils.py#L423
def clip_obj(obj, bbox):
    bbox = decimalize(bbox)
    if "x0" in obj:
        overlap = get_bbox_overlap(obj_to_bbox(obj), bbox)
    else: 
        overlap = None
    if overlap is None:
        return None

https://github.com/jsvine/pdfplumber/blob/694f9193cc13c3757dbe21af9c817dca32d9d5fc/pdfplumber/utils.py#L423

samkit-jain commented 3 years ago

Hi @LiutongZhou, thanks for raising this issue. I think the issue you are facing is a duplicate of https://github.com/jsvine/pdfplumber/issues/383 @jsvine, Shall the PDF and the code example shared in the issue be added as a test case in #388 ?

jsvine commented 3 years ago

Thanks @samkit-jain. In this case, I don't think we need to add a new PDF or test case beyond what's already in #338. The issue isn't really specific to any particular PDF, but just stems from the fact that pdfminer.six's LTAnno objects (extracted when users pass laparams to pdfplumber.open(...)) do not have bounding boxes.