jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

text extraction fails when cropping page to page height and width for certain PDFs #421

Closed sreeni5493 closed 2 years ago

sreeni5493 commented 3 years ago

v1.pdf

pdf_path = "v1.pdf"
with pdfplumber.open(pdf_path) as pdf:
    for page in pages:
        pages = pdf.pages
        page = page.crop((0, 0, page.width, page.height))

ValueError: Bounding box (Decimal('0'), Decimal('0'), Decimal('1437.123'), Decimal('1483.917')) is not fully within parent page bounding box (Decimal('-37.996'), Decimal('-169.832'), Decimal('1399.127'), Decimal('1314.085'))

In this PDF with the above code I get the error shown. Any simple way to fix this. Also if origin is shifted inside some mediabox, cropbox or something wont the library not work. I need to crop because sometimes there are text in negative coordinates which I do not want. So I need to crop text that is there in the visible area of the PDF.

jsvine commented 3 years ago

Thanks for flagging this. I'll look into ways of handling this. In the meantime, you should be able to handle your particular use-case this way:

page = page.crop((0, 0, page.bbox[2], page.bbox[3]))
mlecauchois commented 3 years ago

Hello! I am also concerned about the issue. Any updates on how to deal with shifted origins?

sreeni5493 commented 3 years ago

@mlecauchois @samkit-jain @jsvine

I found the solution for this was removing the descent in pdfminer.six layout.py code

https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/layout.py

In line number 306 and 307, remove descent

Previously:

bbox_lower_left = (0, descent + rise)

bbox_upper_right = (self.adv, descent + rise + fontsize)

Now:

bbox_lower_left = (0, rise)

bbox_upper_right = (self.adv,  rise + fontsize)

But I am not sure why this works clearly though. All I can think of is somehow the descent information is not properly calculated.

Also wondering how Ghostscript fixes this.

I am testing across all my use cases to see if this descent is even needed

jsvine commented 2 years ago

You can now pass strict=False to Page.crop(...) to allow passing a bounding box that extends beyond the page's bbox: https://github.com/jsvine/pdfplumber/releases/tag/v0.7.4