Closed yashsandansing closed 9 months ago
Dropping this here in case it helps you. I noticed certain PDFs have an offset to the page's actual bbox, so adjusting (as follows), helps crop/extract the right region
import pdfplumber
pdf = pdfplumber.open('/path/to/pdf')
# box_offset = 2
for annot in pdf.annots:
pg_number = annot['page_number']
if pg_number in table_counter:
table_counter[pg_number] += 1
else:
table_counter[pg_number] = 1
page = annot_pdf.pages[pg_number - 1]
px0, py0, px1, py1 = page.bbox
bbox = [annot['x0'], annot['top'] + 2 * py0, annot['x1'], annot['bottom'] + 2 * py0]
xt, yt, xb, yb = bbox
# bbox = [xt + box_offset, yt + box_offset, xb - box_offset, yb - box_offset]
roi = page.crop(bbox, relative=False, strict=False)
roi.to_image(resolution=500, antialias=True).save(f"p{pg_number}_t{table_counter[pg_number]}.png")
Hi @yashsandansing, and thanks for flagging. With https://github.com/jsvine/pdfplumber/commit/07d9997ee587723c32e2178be65eea584102bf58 (now available in develop
), this should now be fixed (showing page.to_image().outline_words()
:
Describe the bug
I passed PDFs to the
im.draw_rects(first_page.extract_text_lines())
function, and I'm getting an offset in different directions for each PDF I have passed.Have you tried repairing the PDF?
Yes, I've tried repairing the PDFs with
gs
as well as withpdfplumber
but gotten little to negligible differenceCode to reproduce the problem
PDF file
Please attach any PDFs necessary to reproduce the problem.
iebe102.pdf iemh101.pdf
Expected behavior
Boxes should have been detected with their text
Actual behavior
Offset in multiple directions like this:
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
I've tried resizing the PDF, and setting the resolution, height, width, etc. but nothing seems to work