jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

Table extraction bug when lines are just barely end-to-end #1110

Open jsvine opened 6 months ago

jsvine commented 6 months ago

Describe the bug

Via https://github.com/jsvine/pdfplumber/discussions/1087#discussioncomment-8564694, it seems that there's a bug in how pdfplumber joins lines.

Have you tried repairing the PDF?

Yes.

Code to reproduce the problem

Download the PDF in the linked comment. Then:

import pdfplumber
pdf = pdfplumber.open("2022.Sustainability.Report_NYSE_WM_2022.pdf")
page = pdf.pages[41]
im = page.to_image()
im.reset().debug_tablefinder({
    "join_x_tolerance": 0
})

image

And compare to:

(
    im.reset()
    .draw_lines(
        pdfplumber.table.merge_edges(
            pdfplumber.utils.filter_edges(page.edges, "h"),
            snap_x_tolerance=0,
            snap_y_tolerance=0,
            join_x_tolerance=-1,
            join_y_tolerance=0,
        )
    )
)

image

PDF file

See linked issue.

Expected behavior

pdfplumber's table-finding approach should merge all the sub-lines in each visual line into a single line.

Actual behavior

The method appears to do something strange with the lines, "finding" only certain portions of them.

Screenshots

See above

Environment