jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

Respect `use_text_flow` in `extract_text` #983

Closed dhdaines closed 10 months ago

dhdaines commented 10 months ago

As the title says (though the actual change is in cluster_objects)! Fixes #982

codecov[bot] commented 10 months ago

Codecov Report

Merging #983 (ed707a7) into develop (336f83f) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##           develop      #983   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines         1613      1615    +2     
=========================================
+ Hits          1613      1615    +2     
Files Changed Coverage Δ
pdfplumber/utils/text.py 100.00% <ø> (ø)
pdfplumber/utils/clustering.py 100.00% <100.00%> (ø)
dhdaines commented 10 months ago

Improves "accuracy" (the metric used is suboptimal, it should be Levenshtein on words rather than characters, but whatever) on the benchmarks mentioned above from 75% to 93%, putting pdfplumber at the same level as poppler and better than pdfminer.six: https://github.com/dhdaines/benchmarks

There is definitely room for improvement particularly since we can take advantage of tagged PDFs when they exist!

jsvine commented 10 months ago

Thank you! This was a great catch, and the implementation makes sense to me. Nice, too, to have the preserve_order argument available in cluster_objects(...), which I could see being useful in additional situations. Merging.