jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

Wrong extraction of nested cropped page with relative flag #914

Closed SS-035 closed 1 year ago

SS-035 commented 1 year ago

Describe the bug

When extracting text from a page which was cropped multiple times with relative parameter, the return string is always empty. In the following code snippet, when I try to extract the Column 2, text1 returns the right result, but text2 with relative cropping returns empty.

Code to reproduce the problem

pdf = pdfplumber.open('Lorem.pdf')
page = pdf.pages[0]
cropped = page.crop((page.width / 2, 0, page.width, page.height))
crop1 = cropped.crop((page.width / 2, 0, page.width, cropped.height), relative=False)
text1 = crop1.extract_text()      # returns correct result
crop2 = cropped.crop((0, 0, cropped.width, cropped.height), relative=True)
text2 = crop2.extract_text()      # returns ''

PDF file

Lorem.pdf image

Expected behavior

text2 should also give the correctly extracted Column 2 value like text1.

Actual behavior

text2 is empty string.

Environment

Additional context

Thanks for continuously maintaining this wonderful package.

jsvine commented 1 year ago

Thanks for flagging this, @SS-035. I agree, this seems to be a bug. I’ll look into it and will report back.

jsvine commented 1 year ago

Fix now available in v0.10.0. Feel free to reopen this issue if the problem persists. Thanks again for flagging, @SS-035 👍