jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

[Documentation] `annot` properties #1044

Closed Pk13055 closed 7 months ago

Pk13055 commented 7 months ago

In the documentation, the .annots property is missing. Specifically, what do the 4 float values of Rect represent

jsvine commented 7 months ago

Thanks for rasing this, @Pk13055. I agree: Adding documentation for .annots would be helpful. Is this something you'd be interested in contributing?

To answer your immediate question, Rect comes from the pdfminer.six representation of the annotation and provides the x0, y0, x1, y1 coordinates. y0 and y1 are measured from the bottom-left of the page; pdfplumber gives you the measured-from-top-left versions as top and bottom.

You can see the implementation here: https://github.com/jsvine/pdfplumber/blob/d9561d15ccc7858446a925b8040f7c69e0bdf5ec/pdfplumber/page.py#L240-L279

Pk13055 commented 7 months ago

@jsvine I, unfortunately don't have the required time to tend to this, but my colleague @noelcj9 will be more than happy to take it up, given he's being spending the most time running it through it's paces. However, he is new to open source, so if you can given him a jumpstart on what/where to add the info, that'll be great!

Pk13055 commented 7 months ago

EDIT: After hours of playing around with the various props, I found out that the offset is caused due to the original offset, page's bounding box itself.

import pdfplumber
pdf = pdfplumber.open('/path/to/pdf')
# box_offset = 2
for annot in pdf.annots:
    pg_number = annot['page_number']
    if pg_number in table_counter:
        table_counter[pg_number] += 1
    else:
        table_counter[pg_number] = 1
    page = annot_pdf.pages[pg_number - 1]
    px0, py0, px1, py1 = page.bbox
    bbox = [annot['x0'], annot['top'] + 2 * py0, annot['x1'], annot['bottom'] + 2 * py0]
    xt, yt, xb, yb = bbox
    # bbox = [xt + box_offset, yt + box_offset, xb - box_offset, yb - box_offset]
    roi = page.crop(bbox, relative=False, strict=False)
    roi.to_image(resolution=500, antialias=True).save(f"p{pg_number}_t{table_counter[pg_number]}.png")

In some PDFs, when I try to crop using bbox, ie [annot['x0'], annot['top'], annot['x1'], annot['bottom']], it returns a slightly skewed result (when viewing with roi.to_image(resolution=500, antialias=True).save('temp.png')). The EXACT same process, ie:

roi = page.crop(bbox)
# OR
roi = page.within_bbox(bbox)

results in perfect cropping in another PDF. I'm trying to understand what relative=True and/or strict=True do, so as to remove the offset

For example: p7_t1

(PS - Might be related to #1049)

jsvine commented 7 months ago

@Pk13055 Did you intend to close this?

To answer your other question: relative=True means that the the supplied bbox will be interpreted as an offset from the current page origin-point (this has no effect for cropping a standard page, but does when cropping an CroppedPage object). And strict=True throws an error if the user specifies a bbox outside of the page object's bbox.

Pk13055 commented 7 months ago

Hi yes, I closed it because the documentation for any of the other objects, viz. .rects, .lines are essentially the same. The rest I figured out from pdfminer.six