jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Duplicate value for merged cell instead of `None` #422

Open tungph opened 3 years ago

tungph commented 3 years ago

See #420

codecov[bot] commented 3 years ago

Codecov Report

Merging #422 (c9073fa) into develop (4407362) will decrease coverage by 0.01%. The diff coverage is 100.00%.

:exclamation: Current head c9073fa differs from pull request most recent head 828799b. Consider uploading reports for the commit 828799b to get more accurate results Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #422      +/-   ##
===========================================
- Coverage    98.28%   98.26%   -0.02%     
===========================================
  Files           10       10              
  Lines         1226     1213      -13     
===========================================
- Hits          1205     1192      -13     
  Misses          21       21              
Impacted Files Coverage Δ
pdfplumber/table.py 100.00% <100.00%> (ø)
pdfplumber/cli.py 100.00% <0.00%> (ø)
pdfplumber/convert.py 100.00% <0.00%> (ø)
pdfplumber/container.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4407362...828799b. Read the comment docs.

tungph commented 3 years ago

I saw #387:

jsvine commented 3 years ago

Hi @tungph, thanks for your interest in this library and thank you for the PR. I agree that pdfplumber's current approach to complex tables is not optimal, but it is the result of not wanting to over-impose assumptions about any given table. The handling of merged cells in tables is a tricky topic, and one with a lot of edge cases — what may be a good solution for one set of tables may be a poor one for others. So I'll need to consider this suggestion and PR somewhat carefully, and think about (and test) other PDFs.

Relatedly: I have an idea for providing a richer representation of complex tables — one that would make the structure of the tables much clearer — but it would be a substantial change and so it will probably have to wait for v0.6.0.

tungph commented 3 years ago

Thank you for the consideration, @jsvine. Please let me know if you found bug with my PR. I'm happy to provide fix.