Closed Pk13055 closed 7 months ago
Thanks for rasing this, @Pk13055. I agree: Adding documentation for .annots
would be helpful. Is this something you'd be interested in contributing?
To answer your immediate question, Rect
comes from the pdfminer.six
representation of the annotation and provides the x0
, y0
, x1
, y1
coordinates. y0
and y1
are measured from the bottom-left of the page; pdfplumber
gives you the measured-from-top-left versions as top
and bottom
.
You can see the implementation here: https://github.com/jsvine/pdfplumber/blob/d9561d15ccc7858446a925b8040f7c69e0bdf5ec/pdfplumber/page.py#L240-L279
@jsvine I, unfortunately don't have the required time to tend to this, but my colleague @noelcj9 will be more than happy to take it up, given he's being spending the most time running it through it's paces. However, he is new to open source, so if you can given him a jumpstart on what/where to add the info, that'll be great!
EDIT: After hours of playing around with the various props, I found out that the offset is caused due to the original offset, page's bounding box itself.
import pdfplumber
pdf = pdfplumber.open('/path/to/pdf')
# box_offset = 2
for annot in pdf.annots:
pg_number = annot['page_number']
if pg_number in table_counter:
table_counter[pg_number] += 1
else:
table_counter[pg_number] = 1
page = annot_pdf.pages[pg_number - 1]
px0, py0, px1, py1 = page.bbox
bbox = [annot['x0'], annot['top'] + 2 * py0, annot['x1'], annot['bottom'] + 2 * py0]
xt, yt, xb, yb = bbox
# bbox = [xt + box_offset, yt + box_offset, xb - box_offset, yb - box_offset]
roi = page.crop(bbox, relative=False, strict=False)
roi.to_image(resolution=500, antialias=True).save(f"p{pg_number}_t{table_counter[pg_number]}.png")
In some PDFs, when I try to crop using bbox
, ie [annot['x0'], annot['top'], annot['x1'], annot['bottom']]
, it returns a slightly skewed result (when viewing with roi.to_image(resolution=500, antialias=True).save('temp.png')
). The EXACT same process, ie:
roi = page.crop(bbox)
# OR
roi = page.within_bbox(bbox)
results in perfect cropping in another PDF. I'm trying to understand what relative=True
and/or strict=True
do, so as to remove the offset
For example:
(PS - Might be related to #1049)
@Pk13055 Did you intend to close this?
To answer your other question: relative=True
means that the the supplied bbox
will be interpreted as an offset from the current page origin-point (this has no effect for cropping a standard page, but does when cropping an CroppedPage
object). And strict=True
throws an error if the user specifies a bbox
outside of the page object's bbox
.
Hi yes, I closed it because the documentation for any of the other objects, viz. .rects
, .lines
are essentially the same. The rest I figured out from pdfminer.six
In the documentation, the
.annots
property is missing. Specifically, what do the 4float
values ofRect
represent