Closed jsvine closed 9 months ago
Whoops, my assessment was incorrect here, and based on an assumption I failed to double-check in the code. The good news is we already handle this in a reasonable-seeming way:
Thanks to @cmdlineluser in https://github.com/jsvine/pdfplumber/discussions/1006#discussioncomment-7280701
(See https://github.com/jsvine/pdfplumber/discussions/1006#discussioncomment-7256789 for context.)
Currently,
Page.extract_table(...)
uses.crop(...)
internally for each cell, capturing all characters that overlap at all with the cell's bounding box. So if a character straddles multiple cells in a table, its text would appear in multiple cells of the.extract_table(...)
output. This seems undesirable. Perhaps the method should either:(a) "keep track" of characters it has already assigned to a cell (and then not use them again in another cell), or
(b) only assign characters to cells if they are more than 50% inside? (And then what to do if a cell is perfectly divided 50/50 between two cells? Or 30/30/40 across three cells?)
Thoughts / suggestions / other approaches?