jsvine commented 2 years ago

See this method and docstring for details regarding the implementation: https://github.com/jsvine/pdfplumber/blob/d235d4bbc4ad0c1b290916cae67060badf75198d/pdfplumber/utils.py#L344-L381

The approach is similar to @jsfenfen's helpful suggestion in https://github.com/jsvine/pdfplumber/issues/10#issuecomment-197071287 (thanks!), but with a some tweaks for fidelity and parameterization, as well as integration with the rest of the library.

Notes:

The "density" defaults, by the way, are somewhat arbitrary and simply reflect an intuition after experimenting on a few PDFs.
This PR adds two tests (for basic and cropped pages), but I'm open to adding more.

Other changes:

.extract_text(chars, ...) returns "" if passed no characters, for return-type consistency.
.extract_words(...) now includes doctop in its returned attributes.

codecov[bot] commented 2 years ago

Codecov Report

Merging #532 (d9ae456) into develop (c915a00) will increase coverage by 0.02%. The diff coverage is 100.00%.

:exclamation: Current head d9ae456 differs from pull request most recent head d235d4b. Consider uploading reports for the commit d235d4b to get more accurate results

@@             Coverage Diff             @@
##           develop     #532      +/-   ##
===========================================
+ Coverage    98.77%   98.79%   +0.02%     
===========================================
  Files           10       10              
  Lines         1220     1243      +23     
===========================================
+ Hits          1205     1228      +23     
  Misses          15       15

Impacted Files	Coverage Δ
pdfplumber/page.py	`100.00% <100.00%> (ø)`
pdfplumber/utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c915a00...d235d4b. Read the comment docs.

jsvine commented 2 years ago

Thanks for the approval!

jsvine / pdfplumber

Add experimental .extract_text(layout=True) #532

Codecov Report