jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Add experimental .extract_text(layout=True) #532

Closed jsvine closed 2 years ago

jsvine commented 2 years ago

See this method and docstring for details regarding the implementation: https://github.com/jsvine/pdfplumber/blob/d235d4bbc4ad0c1b290916cae67060badf75198d/pdfplumber/utils.py#L344-L381

The approach is similar to @jsfenfen's helpful suggestion in https://github.com/jsvine/pdfplumber/issues/10#issuecomment-197071287 (thanks!), but with a some tweaks for fidelity and parameterization, as well as integration with the rest of the library.

Notes:

Other changes:

codecov[bot] commented 2 years ago

Codecov Report

Merging #532 (d9ae456) into develop (c915a00) will increase coverage by 0.02%. The diff coverage is 100.00%.

:exclamation: Current head d9ae456 differs from pull request most recent head d235d4b. Consider uploading reports for the commit d235d4b to get more accurate results Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #532      +/-   ##
===========================================
+ Coverage    98.77%   98.79%   +0.02%     
===========================================
  Files           10       10              
  Lines         1220     1243      +23     
===========================================
+ Hits          1205     1228      +23     
  Misses          15       15              
Impacted Files Coverage Δ
pdfplumber/page.py 100.00% <100.00%> (ø)
pdfplumber/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c915a00...d235d4b. Read the comment docs.

jsvine commented 2 years ago

Thanks for the approval!