Inconsistent results when cropping an already cropped page

samkit-jain commented 4 years ago

Describe the bug

When cropping an already cropped page, the objects are not preserved.

Code to reproduce the problem

import pdfplumber

# Make sure the file is downloaded at file.pdf
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]

# Crop and save the top page and keep only the bottom 20%.
bottom = page.crop((0, 0.8 * float(page.height), page.width, page.height))
im = bottom.to_image(resolution=150)
im.save("bottom.png", format="PNG")

# Now, crop and save the left half of the cropped page.
bottom_left = bottom.crop((0, 0, 0.5 * float(bottom.width), bottom.height))
im = bottom_left.to_image(resolution=150)
im.save("bottom_left.png", format="PNG")

# Now, crop and save the right half of the cropped page.
bottom_right = bottom.crop((0.5 * float(bottom.width), 0, bottom.width, bottom.height))
im = bottom_right.to_image(resolution=150)
im.save("bottom_right.png", format="PNG")

PDF file

examples/pdfs/ag-energy-round-up-2017-02-24.pdf

Expected behavior

bottom.png - The bottom portion of the page is saved.
bottom_left.png - The left half of bottom portion of the page is saved.
bottom_right.png - The right half of the bottom portion of the page is saved.

Actual behavior

bottom.png - The bottom portion of the page is saved.
bottom_left.png - The left half of the top portion of the page is saved.
bottom_right.png - The right half of the top portion of the page is saved.

Screenshots

bottom.png bottom

bottom_left.png

bottom_right.png

Environment

pdfplumber version: 0.5.22
Python version: 3.8.2
OS: Ubuntu 18.04 LTS

Additional context

The issue was found when working on #244

jsvine commented 4 years ago

Thanks for flagging this, @samkit-jain! I think this is what's happening:

pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).
The second and third crop commands in the example assume (understandably) that the coordinates have been adjusted after the initial crop.

I can see a handful of potential solutions:

(a) Do nothing, but communicate better to users that .crop's bbox should be in terms of the original PDF, not the crop.
(b) Add a parameter to .crop, such as relative = True, that would let users indicate that they're providing relative coordinates, not absolute ones.
(c) Change the default, so that .crop assumes a relative-position bbox, but provide a parameter (e.g., relative = False) that reverts to the original approach.
(d) Make it so that pdfplumber automatically adjusts all coordinates (not just of the page's bbox, but of all extracted objects as well) when cropping.

Thoughts?

samkit-jain commented 4 years ago

pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

What is the reasoning behind this?

(d) Make it so that pdfplumber automatically adjusts all coordinates (not just of the page's bbox, but of all extracted objects as well) when cropping.

If by this you mean that the cropped page would be treated as a "real" page and all the operations like extract_text, extract_words, etc can be run on it as if it was the parent page, then yes, I'd prefer option D.

jsvine commented 4 years ago

pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

What is the reasoning behind this?

Great question; I should have elaborated earlier. The answer: Mostly for simplicity. A page's bounding box should, I think, exist in the same coordinate system as each object on the page. So changing the coordinates of the bounding box would mean changing the coordinates of all objects in the resulting cropped page. To me, ensuring that all coordinates relevant to all objects were moved post-crop seemed a tricky task.

To take an oversimplified example: Let's say a page's original bounding box was 0, 0, 20, 20, and we have a single point 10, 10. If we crop the page to 10, 10, 20, 20 ...

Currently: the cropped page's bounding box would be 10, 10, 20, 20 and the point would remain at 10, 10
If we readjusted all coordinates post-crop: the cropped page's new bounding box would be 0, 0, 10, 10 and the point would need to be moved to 0, 0.

The more I think about this, I think I prefer the following combination of improvements:

Keeping the current cropping system as-is — i.e., it does not alter the coordinate system
Throwing an error (or warning) if a user tries to crop part of a cropped page (or any page) that is not fully within the page's bounding box
Adding a relative parameter to both .crop and .within_bbox that allows users to pass a relative bounding box instead of using the absolute-coordinate system. E.g., to crop the example I mentioned above, to get the bottom half of the cropped page, they could call either cropped_page.crop((10, 15, 20, 20)) or cropped_page.crop((0, 5, 10, 10), relative = True) — the two would be equivalent.

samkit-jain commented 4 years ago

Thanks for the explanation. +1 for the warning. Also, adding a new relative parameter would also be the most backwards compatible.

jsvine commented 4 years ago

Closing now, as resolved in two commits referenced above and available in v0.5.23. Thanks again for raising the issue!

jsvine / pdfplumber