Closed shangrilar closed 1 year ago
Thanks for your interest in this library, @shangrilar. The detection of lines and rects (the source of .edges
) is performed by pdfminer.six, pdfplumber's main dependency. But, in either case, that (and this) library is just reflecting how the PDF itself is constructed. For some reason, it really is encoding thousands of rectangles, as can be seen when inspecting the raw PDF commands:
[...]
[...]
Given that situation, I don't think this is a bug, so I'm closing this issue. But feel free to continue the discussion here.
Describe the bug
Hi, Thanks for your amazing projects.
I tried to extract tables from pdf files. But some pdf files are detected too many edges from page(above 100,000)
Code to reproduce the problem
def curves_to_edges(cs): edges = [] for c in cs: edges += pdfplumber.utils.rect_to_edges(c) return edges
lines = curves_to_edges(page.curves + page.edges)
PDF file
arti_error-3.pdf
Expected behavior
one thick line can be detected as one line
Actual behavior
one line is detected like many lines.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
Add any other context/notes about the problem here.