jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

Add `.extract_table(...)` logic to avoid assigning characters to multiple cells #1013

Closed jsvine closed 9 months ago

jsvine commented 9 months ago

(See https://github.com/jsvine/pdfplumber/discussions/1006#discussioncomment-7256789 for context.)

Currently, Page.extract_table(...) uses .crop(...) internally for each cell, capturing all characters that overlap at all with the cell's bounding box. So if a character straddles multiple cells in a table, its text would appear in multiple cells of the .extract_table(...) output. This seems undesirable. Perhaps the method should either:

(a) "keep track" of characters it has already assigned to a cell (and then not use them again in another cell), or

(b) only assign characters to cells if they are more than 50% inside? (And then what to do if a cell is perfectly divided 50/50 between two cells? Or 30/30/40 across three cells?)

Thoughts / suggestions / other approaches?

jsvine commented 9 months ago

Whoops, my assessment was incorrect here, and based on an assumption I failed to double-check in the code. The good news is we already handle this in a reasonable-seeming way:

https://github.com/jsvine/pdfplumber/blob/94da66c1b32954d02ef03a5a9b30d0177d27af84/pdfplumber/table.py#L399-L410

Thanks to @cmdlineluser in https://github.com/jsvine/pdfplumber/discussions/1006#discussioncomment-7280701