Open tungph opened 3 years ago
Hi @tungph, thanks for your interest in this library and thank you for the PR. I agree that pdfplumber
's current approach to complex tables is not optimal, but it is the result of not wanting to over-impose assumptions about any given table. The handling of merged cells in tables is a tricky topic, and one with a lot of edge cases — what may be a good solution for one set of tables may be a poor one for others. So I'll need to consider this suggestion and PR somewhat carefully, and think about (and test) other PDFs.
Relatedly: I have an idea for providing a richer representation of complex tables — one that would make the structure of the tables much clearer — but it would be a substantial change and so it will probably have to wait for v0.6.0
.
Thank you for the consideration, @jsvine. Please let me know if you found bug with my PR. I'm happy to provide fix.
I would love to have this as an option when extracting tables even if it is not the default behavior.
Hi there, Firstly, I want to thank you for creating this awesome
pdfplumber
project. I have been using it a lot lately. Secondly, I'm facing a problem and want to propose change to the output of the table module regarding the utilityextract_tables
with merged cells. As it is now, only the first row of the merged cell has value, the remaining cells getNone
.It would be more semantically correct to have the value in merged cell duplicated like this
I'm preparing a PR, please consider it. Thank you!
Update 1
Here's the PR: #422
Approach: finding the centroid of each "cell", and then get the cells from the centroids for each row.
Update 2:
I've encountered this case: Expected output is:
None
means not a cell there, and it makes the number of elements is even between rows which is convenient for further processing. Note: the PR is updated