jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.73k stars 670 forks source link

Duplicate value for merged cell instead of `None` #420

Open tungph opened 3 years ago

tungph commented 3 years ago

Hi there, Firstly, I want to thank you for creating this awesome pdfplumber project. I have been using it a lot lately. Secondly, I'm facing a problem and want to propose change to the output of the table module regarding the utility extract_tables with merged cells. As it is now, only the first row of the merged cell has value, the remaining cells get None. image

Header 1, Header 2
Merged cell 1, Cell 2-1
None, Cell 2-2

It would be more semantically correct to have the value in merged cell duplicated like this

Header 1, Header 2
Merged cell 1, Cell 2-1
Merged cell 1, Cell 2-2

I'm preparing a PR, please consider it. Thank you!

Update 1

Here's the PR: #422

Approach: finding the centroid of each "cell", and then get the cells from the centroids for each row. image

Update 2:

I've encountered this case: image Expected output is:

1-1, 2-1
None, 2-2
None, 2-3

None means not a cell there, and it makes the number of elements is even between rows which is convenient for further processing. Note: the PR is updated

jsvine commented 3 years ago

Hi @tungph, thanks for your interest in this library and thank you for the PR. I agree that pdfplumber's current approach to complex tables is not optimal, but it is the result of not wanting to over-impose assumptions about any given table. The handling of merged cells in tables is a tricky topic, and one with a lot of edge cases — what may be a good solution for one set of tables may be a poor one for others. So I'll need to consider this suggestion and PR somewhat carefully, and think about (and test) other PDFs.

Relatedly: I have an idea for providing a richer representation of complex tables — one that would make the structure of the tables much clearer — but it would be a substantial change and so it will probably have to wait for v0.6.0.

tungph commented 3 years ago

Thank you for the consideration, @jsvine. Please let me know if you found bug with my PR. I'm happy to provide fix.

Safrone commented 7 months ago

I would love to have this as an option when extracting tables even if it is not the default behavior.