jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

debug_tablefinder is weirdly offset #1078

Closed px-xp closed 8 months ago

px-xp commented 8 months ago

Describe the bug

When I run debug_tablefinder on a PDF the offset of the overview is weird.

Have you tried repairing the PDF?

Yes

Code to reproduce the problem

p: pdfplumber.PDF
pdf: pdfplumber.PDF = pdfplumber.open("./buggy.pdf")

p0 = pdf.pages[0]
im = p0.to_image(resolution=150)
im.reset().debug_tablefinder({
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines"
})
print(len(p0.find_tables()))
im.draw_rects(p0.chars)
im.show()

PDF file

Please attach any PDFs necessary to reproduce the problem.

buggy.pdf

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

Debug view should line up correctly.

Actual behavior

Debug view is offset.

Screenshots

image

Would expect this to line up nicely like other PDFs but it doesn't.

Environment

Additional context

Add any other context/notes about the problem here.

jsvine commented 8 months ago

Hi @px-xp, thank you for the clear bug report. Luckily, a fix for this has already been made (https://github.com/jsvine/pdfplumber/commit/07d9997ee587723c32e2178be65eea584102bf58) and is available on develop. To install that version before it's part of the next release, you can run pip install -U git+https://github.com/jsvine/pdfplumber@develop

Here is the result of im.reset().debug_tablefinder():

image

px-xp commented 8 months ago

Thank you @jsvine. This is a really cool tool!