Add `.extract_table(...)` logic to avoid assigning characters to multiple cells

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.02k stars 619 forks source link

(See https://github.com/jsvine/pdfplumber/discussions/1006#discussioncomment-7256789 for context.)

Currently, Page.extract_table(...) uses .crop(...) internally for each cell, capturing all characters that overlap at all with the cell's bounding box. So if a character straddles multiple cells in a table, its text would appear in multiple cells of the .extract_table(...) output. This seems undesirable. Perhaps the method should either:

(a) "keep track" of characters it has already assigned to a cell (and then not use them again in another cell), or

(b) only assign characters to cells if they are more than 50% inside? (And then what to do if a cell is perfectly divided 50/50 between two cells? Or 30/30/40 across three cells?)

Thoughts / suggestions / other approaches?

jsvine / pdfplumber

Add `.extract_table(...)` logic to avoid assigning characters to multiple cells #1013