jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Inconsistent results when cropping an already cropped page #245

Closed samkit-jain closed 4 years ago

samkit-jain commented 4 years ago

Describe the bug

When cropping an already cropped page, the objects are not preserved.

Code to reproduce the problem

import pdfplumber

# Make sure the file is downloaded at file.pdf
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]

# Crop and save the top page and keep only the bottom 20%.
bottom = page.crop((0, 0.8 * float(page.height), page.width, page.height))
im = bottom.to_image(resolution=150)
im.save("bottom.png", format="PNG")

# Now, crop and save the left half of the cropped page.
bottom_left = bottom.crop((0, 0, 0.5 * float(bottom.width), bottom.height))
im = bottom_left.to_image(resolution=150)
im.save("bottom_left.png", format="PNG")

# Now, crop and save the right half of the cropped page.
bottom_right = bottom.crop((0.5 * float(bottom.width), 0, bottom.width, bottom.height))
im = bottom_right.to_image(resolution=150)
im.save("bottom_right.png", format="PNG")

PDF file

examples/pdfs/ag-energy-round-up-2017-02-24.pdf

Expected behavior

Actual behavior

Screenshots

bottom.png bottom

bottom_left.png bottom_left

bottom_right.png bottom_right

Environment

Additional context

The issue was found when working on #244

jsvine commented 4 years ago

Thanks for flagging this, @samkit-jain! I think this is what's happening:

I can see a handful of potential solutions:

Thoughts?

samkit-jain commented 4 years ago

pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

What is the reasoning behind this?

(d) Make it so that pdfplumber automatically adjusts all coordinates (not just of the page's bbox, but of all extracted objects as well) when cropping.

If by this you mean that the cropped page would be treated as a "real" page and all the operations like extract_text, extract_words, etc can be run on it as if it was the parent page, then yes, I'd prefer option D.

jsvine commented 4 years ago

pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

What is the reasoning behind this?

Great question; I should have elaborated earlier. The answer: Mostly for simplicity. A page's bounding box should, I think, exist in the same coordinate system as each object on the page. So changing the coordinates of the bounding box would mean changing the coordinates of all objects in the resulting cropped page. To me, ensuring that all coordinates relevant to all objects were moved post-crop seemed a tricky task.

To take an oversimplified example: Let's say a page's original bounding box was 0, 0, 20, 20, and we have a single point 10, 10. If we crop the page to 10, 10, 20, 20 ...

The more I think about this, I think I prefer the following combination of improvements:

samkit-jain commented 4 years ago

Thanks for the explanation. +1 for the warning. Also, adding a new relative parameter would also be the most backwards compatible.

jsvine commented 4 years ago

Closing now, as resolved in two commits referenced above and available in v0.5.23. Thanks again for raising the issue!