jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Too many edges detected #779

Closed shangrilar closed 1 year ago

shangrilar commented 1 year ago

Describe the bug

Hi, Thanks for your amazing projects.

I tried to extract tables from pdf files. But some pdf files are detected too many edges from page(above 100,000)

Code to reproduce the problem

def curves_to_edges(cs): edges = [] for c in cs: edges += pdfplumber.utils.rect_to_edges(c) return edges

lines = curves_to_edges(page.curves + page.edges)

PDF file

arti_error-3.pdf

Expected behavior

one thick line can be detected as one line

Actual behavior

one line is detected like many lines.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

Add any other context/notes about the problem here.

jsvine commented 1 year ago

Thanks for your interest in this library, @shangrilar. The detection of lines and rects (the source of .edges) is performed by pdfminer.six, pdfplumber's main dependency. But, in either case, that (and this) library is just reflecting how the PDF itself is constructed. For some reason, it really is encoding thousands of rectangles, as can be seen when inspecting the raw PDF commands:

Screen Shot 2022-12-19 at 3 35 51 AM

[...]

Screen Shot 2022-12-19 at 3 34 05 AM

[...]

Given that situation, I don't think this is a bug, so I'm closing this issue. But feel free to continue the discussion here.