Closed dhdaines closed 4 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
3e74fb1
) 100.00% compared to head (efeb080
) 100.00%. Report is 1 commits behind head on develop.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I haven't played around with this yet, but it seems like a reasonable idea and doesn't interfere with core pdfplumber
functionality, so I'm inclined to merge. Is it ready to merge?
Is there a general method to properly transform PDF BBoxes into
pdfplumber
ones for a page?
If I'm understanding correctly, this this question pertains to flipping the vertical coordinates, so that (x0, y0, x1, y1)
(i.e., bbox with origin at the bottom-left) becomes (x0, top, x1, bottom)
(origin at top-left). Is that right? If so: We just calculate the top
and bottom
attributes once, on parsing:
... and then use (x0, top, x1, bottom)
as the standard bbox throughout.
I haven't played around with this yet, but it seems like a reasonable idea and doesn't interfere with core
pdfplumber
functionality, so I'm inclined to merge. Is it ready to merge?
I think maybe I'll add a companion / convenience method find
to get just the first instance of an element, and at least minimally handle the issue below (it's a bit more complicated, basically I can use the example code above as a unit test).
Is there a general method to properly transform PDF BBoxes into
pdfplumber
ones for a page?If I'm understanding correctly, this this question pertains to flipping the vertical coordinates, so that
(x0, y0, x1, y1)
(i.e., bbox with origin at the bottom-left) becomes(x0, top, x1, bottom)
(origin at top-left). Is that right? If so: We just calculate thetop
andbottom
attributes once, on parsing:
The issue is a bit more complicated because when you crop a page, all of the object coordinates go through the crop_fn
which adjusts them. So far, so good, but structure tree elements can have a BBox
attribute specified on them which is not attached to any particular object. In element_bbox
we will prefer this if it exists, but then we will also have to transform it, not only flipping the vertical coordinates, but also applying crop_fn
.
I can probably hack something up so this minimally works but we might want to refactor it at some point.
Should be ready to merge now!
I didn't realize that cropping a page doesn't actually translate the coordinates of the objects, it just clips them to the new bounding box - nonetheless, this didn't work right for structure elements with BBox attributes, and now it does.
Thanks, now merged! (And correct re. the non-translation of coordinates.)
It's useful to be able to search in the structure tree - this has to be done from the
PDFStructTree
object itself since we return a dictionary fromstructure_tree
in keeping with the general way ofpdfplumber
.Also to get a BBox from an element for visual debugging - note the FIXME, if you play games with cropped pages, this will fail, but in general that's unlikely, you would have to do something like:
and then try to get the BBox of an element where it is explicitly specified in the attributes of that element (usually this is only the case for
Figure
andTable
).Is there a general method to properly transform PDF BBoxes into
pdfplumber
ones for a page?