Closed samkit-jain closed 4 years ago
Thanks for flagging this, @samkit-jain! I think this is what's happening:
pdfplumber
does not adjust coordinates after a crop (this is intentional, but open to discussion).
The second and third crop commands in the example assume (understandably) that the coordinates have been adjusted after the initial crop.
I can see a handful of potential solutions:
(a) Do nothing, but communicate better to users that .crop
's bbox
should be in terms of the original PDF, not the crop.
(b) Add a parameter to .crop
, such as relative = True
, that would let users indicate that they're providing relative coordinates, not absolute ones.
(c) Change the default, so that .crop
assumes a relative-position bbox
, but provide a parameter (e.g., relative = False
) that reverts to the original approach.
(d) Make it so that pdfplumber
automatically adjusts all coordinates (not just of the page's bbox
, but of all extracted objects as well) when cropping.
Thoughts?
pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).
What is the reasoning behind this?
(d) Make it so that pdfplumber automatically adjusts all coordinates (not just of the page's bbox, but of all extracted objects as well) when cropping.
If by this you mean that the cropped page would be treated as a "real" page and all the operations like extract_text
, extract_words
, etc can be run on it as if it was the parent page, then yes, I'd prefer option D.
pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).
What is the reasoning behind this?
Great question; I should have elaborated earlier. The answer: Mostly for simplicity. A page's bounding box should, I think, exist in the same coordinate system as each object on the page. So changing the coordinates of the bounding box would mean changing the coordinates of all objects in the resulting cropped page. To me, ensuring that all coordinates relevant to all objects were moved post-crop seemed a tricky task.
To take an oversimplified example: Let's say a page's original bounding box was 0, 0, 20, 20
, and we have a single point 10, 10
. If we crop the page to 10, 10, 20, 20
...
Currently: the cropped page's bounding box would be 10, 10, 20, 20
and the point would remain at 10, 10
If we readjusted all coordinates post-crop: the cropped page's new bounding box would be 0, 0, 10, 10
and the point would need to be moved to 0, 0
.
The more I think about this, I think I prefer the following combination of improvements:
Keeping the current cropping system as-is — i.e., it does not alter the coordinate system
Throwing an error (or warning) if a user tries to crop part of a cropped page (or any page) that is not fully within the page's bounding box
Adding a relative
parameter to both .crop
and .within_bbox
that allows users to pass a relative bounding box instead of using the absolute-coordinate system. E.g., to crop the example I mentioned above, to get the bottom half of the cropped page, they could call either cropped_page.crop((10, 15, 20, 20))
or cropped_page.crop((0, 5, 10, 10), relative = True)
— the two would be equivalent.
Thanks for the explanation. +1 for the warning. Also, adding a new relative
parameter would also be the most backwards compatible.
Closing now, as resolved in two commits referenced above and available in v0.5.23. Thanks again for raising the issue!
Describe the bug
When cropping an already cropped page, the objects are not preserved.
Code to reproduce the problem
PDF file
examples/pdfs/ag-energy-round-up-2017-02-24.pdf
Expected behavior
bottom.png
- The bottom portion of the page is saved.bottom_left.png
- The left half of bottom portion of the page is saved.bottom_right.png
- The right half of the bottom portion of the page is saved.Actual behavior
bottom.png
- The bottom portion of the page is saved.bottom_left.png
- The left half of the top portion of the page is saved.bottom_right.png
- The right half of the top portion of the page is saved.Screenshots
bottom.png
bottom_left.png
bottom_right.png
Environment
Additional context
The issue was found when working on #244