jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.37k stars 654 forks source link

Page crop bounding box should be relative and not absolute #1199

Closed johnathanchiu closed 1 week ago

johnathanchiu commented 1 week ago

Describe the bug

Currently, the bounding box on a CroppedPage object is not relative to the original page but absolute to the crop itself. This doesn't make sense especially if the object coordinates of the cropped page are still relative to the parent page coordinates. I think these should be consistent. Either make both the coordinates of the CroppedPage and objects relative to the parent page or make both absolute to the crop.

Code to reproduce the problem

import pdfplumber

with pdfplumber.open("background-checks.pdf") as pdf:
    page = pdf.pages[0]
    cropped_page = page.within_bbox((0, 37.0, page.width, 72.0))

    cropped_page.to_image().show()

    print("Crop bounding box:", cropped_page.bbox)

    for obj in cropped_page.objects["char"]:
        print("Object Values:", obj)
        break

PDF file

https://github.com/jsvine/pdfplumber/blob/stable/examples/pdfs/background-checks.pdf

Actual behavior

Looking at the print, the y0 and y1 values are far outside the 37.0 and 72.0.

Crop bounding box: (0, 37.0, 1008, 72.0)

Object Values: {'matrix': (15.12, 0.0, 0.0, 15.12, 465.94, 555.12), 'fontname': 'WEVZII+ArialMT', 'adv': 0.722, 'upright': True, 'x0': 465.94, 'y0': 551.91456, 'x1': 476.85663999999997, 'y1': 567.03456, 'width': 10.916639999999973, 'height': 15.120000000000005, 'size': 15.120000000000005, 'mcid': None, 'tag': None, 'object_type': 'char', 'page_number': 1, 'ncs': 'ICCBased', 'text': 'N', 'stroking_color': None, 'stroking_pattern': None, 'non_stroking_color': (1, 0, 0), 'non_stroking_pattern': None, 'top': 44.965439999999944, 'bottom': 60.08543999999995, 'doctop': 44.965439999999944}

This change of coordinates doesn't make sense if I want to use the bounding box crop without the parent crop.

Environment

johnathanchiu commented 1 week ago

I think the coordinate system here is really confusing. I tried cropping a piece of the page and got this: ValueError: (0, 755.0, 612, 720.0) has a negative width or height.

Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values (x0, top, x1, bottom). Is top not actually top?

As per the documentation, this should be fine.

johnathanchiu commented 1 week ago

I have another example of how weird this coordinate system is. Run the following:

with pdfplumber.open("background-checks.pdf") as pdf:
    page = pdf.pages[0]
    im = page.to_image()
    im2 = page.to_image()
    for obj in page.objects["char"]:
        # correct orientation
        im.draw_rect([obj["x0"], page.height - obj["y1"], obj["x1"], page.height - obj["y0"]])
        # incorrect orientation
        im2.draw_rect([obj["x0"], obj["y0"], obj["x1"], obj["y1"]])

    im.show()
    im2.show()

im returns the correct orientation whereas im2 doesn't. tmpf6lnn1ku tmp585_la6y

johnathanchiu commented 1 week ago

Upon further experimentation, I found that the objects in the pdf actually have a top and bottom key. This means the right way of doing this is:

with pdfplumber.open("background-checks.pdf") as pdf:
    page = pdf.pages[0]
    im = page.to_image()
    for obj in page.objects["char"]:
        im.draw_rect([obj["x0"], obj["top"], obj["x1"], obj["bottom"]])

    im.show()