Open jsvine opened 6 months ago
Via https://github.com/jsvine/pdfplumber/discussions/1087#discussioncomment-8564694, it seems that there's a bug in how pdfplumber joins lines.
pdfplumber
Yes.
Download the PDF in the linked comment. Then:
import pdfplumber pdf = pdfplumber.open("2022.Sustainability.Report_NYSE_WM_2022.pdf") page = pdf.pages[41] im = page.to_image() im.reset().debug_tablefinder({ "join_x_tolerance": 0 })
And compare to:
( im.reset() .draw_lines( pdfplumber.table.merge_edges( pdfplumber.utils.filter_edges(page.edges, "h"), snap_x_tolerance=0, snap_y_tolerance=0, join_x_tolerance=-1, join_y_tolerance=0, ) ) )
See linked issue.
pdfplumber's table-finding approach should merge all the sub-lines in each visual line into a single line.
The method appears to do something strange with the lines, "finding" only certain portions of them.
See above
0.11.0
Describe the bug
Via https://github.com/jsvine/pdfplumber/discussions/1087#discussioncomment-8564694, it seems that there's a bug in how
pdfplumber
joins lines.Have you tried repairing the PDF?
Yes.
Code to reproduce the problem
Download the PDF in the linked comment. Then:
And compare to:
PDF file
See linked issue.
Expected behavior
pdfplumber
's table-finding approach should merge all the sub-lines in each visual line into a single line.Actual behavior
The method appears to do something strange with the lines, "finding" only certain portions of them.
Screenshots
See above
Environment
0.11.0