Wrong extraction of nested cropped page with relative flag

SS-035 commented 1 year ago

Describe the bug

When extracting text from a page which was cropped multiple times with relative parameter, the return string is always empty. In the following code snippet, when I try to extract the Column 2, text1 returns the right result, but text2 with relative cropping returns empty.

Code to reproduce the problem

pdf = pdfplumber.open('Lorem.pdf')
page = pdf.pages[0]
cropped = page.crop((page.width / 2, 0, page.width, page.height))
crop1 = cropped.crop((page.width / 2, 0, page.width, cropped.height), relative=False)
text1 = crop1.extract_text()      # returns correct result
crop2 = cropped.crop((0, 0, cropped.width, cropped.height), relative=True)
text2 = crop2.extract_text()      # returns ''

PDF file

Lorem.pdf

Expected behavior

text2 should also give the correctly extracted Column 2 value like text1.

Actual behavior

text2 is empty string.

Environment

pdfplumber version: 0.9.0
Python version: 3.10.10
OS: Ubuntu 22.04

Additional context

Both crop1 and crop2 have the exact same bounding box.
This used to work as intended in older pdfplumber version (tested on v0.5.28).

Thanks for continuously maintaining this wonderful package.

jsvine commented 1 year ago

Thanks for flagging this, @SS-035. I agree, this seems to be a bug. I’ll look into it and will report back.

jsvine commented 1 year ago

Fix now available in v0.10.0. Feel free to reopen this issue if the problem persists. Thanks again for flagging, @SS-035 👍

jsvine / pdfplumber