jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
5.99k stars 618 forks source link

Some utility methods for logical structure #1095

Closed dhdaines closed 4 months ago

dhdaines commented 4 months ago

It's useful to be able to search in the structure tree - this has to be done from the PDFStructTree object itself since we return a dictionary from structure_tree in keeping with the general way of pdfplumber.

Also to get a BBox from an element for visual debugging - note the FIXME, if you play games with cropped pages, this will fail, but in general that's unlikely, you would have to do something like:

pdf = pdfplumber.open(pdffile)
page = pdf.pages[0].crop(some_bbox)
stree = PDFStructTree(pdf, page)  # NO! Don't do this! Why would you do this?

and then try to get the BBox of an element where it is explicitly specified in the attributes of that element (usually this is only the case for Figure and Table).

Is there a general method to properly transform PDF BBoxes into pdfplumber ones for a page?

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (3e74fb1) 100.00% compared to head (efeb080) 100.00%. Report is 1 commits behind head on develop.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #1095 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 19 19 Lines 1928 1996 +68 ========================================= + Hits 1928 1996 +68 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jsvine commented 4 months ago

I haven't played around with this yet, but it seems like a reasonable idea and doesn't interfere with core pdfplumber functionality, so I'm inclined to merge. Is it ready to merge?

Is there a general method to properly transform PDF BBoxes into pdfplumber ones for a page?

If I'm understanding correctly, this this question pertains to flipping the vertical coordinates, so that (x0, y0, x1, y1) (i.e., bbox with origin at the bottom-left) becomes (x0, top, x1, bottom) (origin at top-left). Is that right? If so: We just calculate the top and bottom attributes once, on parsing:

https://github.com/jsvine/pdfplumber/blob/1ad3905612d6c9ac9b285332850da23a0a96f0ba/pdfplumber/page.py#L399-L402

... and then use (x0, top, x1, bottom) as the standard bbox throughout.

dhdaines commented 4 months ago

I haven't played around with this yet, but it seems like a reasonable idea and doesn't interfere with core pdfplumber functionality, so I'm inclined to merge. Is it ready to merge?

I think maybe I'll add a companion / convenience method find to get just the first instance of an element, and at least minimally handle the issue below (it's a bit more complicated, basically I can use the example code above as a unit test).

Is there a general method to properly transform PDF BBoxes into pdfplumber ones for a page?

If I'm understanding correctly, this this question pertains to flipping the vertical coordinates, so that (x0, y0, x1, y1) (i.e., bbox with origin at the bottom-left) becomes (x0, top, x1, bottom) (origin at top-left). Is that right? If so: We just calculate the top and bottom attributes once, on parsing:

The issue is a bit more complicated because when you crop a page, all of the object coordinates go through the crop_fn which adjusts them. So far, so good, but structure tree elements can have a BBox attribute specified on them which is not attached to any particular object. In element_bbox we will prefer this if it exists, but then we will also have to transform it, not only flipping the vertical coordinates, but also applying crop_fn.

I can probably hack something up so this minimally works but we might want to refactor it at some point.

dhdaines commented 4 months ago

Should be ready to merge now!

I didn't realize that cropping a page doesn't actually translate the coordinates of the objects, it just clips them to the new bounding box - nonetheless, this didn't work right for structure elements with BBox attributes, and now it does.

jsvine commented 4 months ago

Thanks, now merged! (And correct re. the non-translation of coordinates.)