jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

Offset in bounding boxes for every pdf passed #1049

Closed yashsandansing closed 9 months ago

yashsandansing commented 10 months ago

Describe the bug

I passed PDFs to the im.draw_rects(first_page.extract_text_lines()) function, and I'm getting an offset in different directions for each PDF I have passed.

Have you tried repairing the PDF?

Yes, I've tried repairing the PDFs with gs as well as with pdfplumber but gotten little to negligible difference

Code to reproduce the problem

import pdfplumber

repaired = pdfplumber.repair("/content/iebe102.pdf", outfile="/content/repaired2.pdf")
pdf = pdfplumber.open("/content/repaired2.pdf")
first_page = pdf.pages[0]
im=first_page.to_image()

# Run either one from below
# im.outline_words()
im.draw_rects(first_page.extract_text_lines())

PDF file

Please attach any PDFs necessary to reproduce the problem.

iebe102.pdf iemh101.pdf

Expected behavior

Boxes should have been detected with their text

Actual behavior

Offset in multiple directions like this: Screenshot 2023-11-28 172104 Screenshot 2023-11-28 171441

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

I've tried resizing the PDF, and setting the resolution, height, width, etc. but nothing seems to work

Pk13055 commented 10 months ago

Dropping this here in case it helps you. I noticed certain PDFs have an offset to the page's actual bbox, so adjusting (as follows), helps crop/extract the right region

import pdfplumber
pdf = pdfplumber.open('/path/to/pdf')
# box_offset = 2
for annot in pdf.annots:
    pg_number = annot['page_number']
    if pg_number in table_counter:
        table_counter[pg_number] += 1
    else:
        table_counter[pg_number] = 1
    page = annot_pdf.pages[pg_number - 1]
    px0, py0, px1, py1 = page.bbox
    bbox = [annot['x0'], annot['top'] + 2 * py0, annot['x1'], annot['bottom'] + 2 * py0]
    xt, yt, xb, yb = bbox
    # bbox = [xt + box_offset, yt + box_offset, xb - box_offset, yb - box_offset]
    roi = page.crop(bbox, relative=False, strict=False)
    roi.to_image(resolution=500, antialias=True).save(f"p{pg_number}_t{table_counter[pg_number]}.png")
jsvine commented 9 months ago

Hi @yashsandansing, and thanks for flagging. With https://github.com/jsvine/pdfplumber/commit/07d9997ee587723c32e2178be65eea584102bf58 (now available in develop), this should now be fixed (showing page.to_image().outline_words():

image

image