Closed dhdaines closed 10 months ago
Merging #983 (ed707a7) into develop (336f83f) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## develop #983 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 18 18
Lines 1613 1615 +2
=========================================
+ Hits 1613 1615 +2
Files Changed | Coverage Δ | |
---|---|---|
pdfplumber/utils/text.py | 100.00% <ø> (ø) |
|
pdfplumber/utils/clustering.py | 100.00% <100.00%> (ø) |
Improves "accuracy" (the metric used is suboptimal, it should be Levenshtein on words rather than characters, but whatever) on the benchmarks mentioned above from 75% to 93%, putting pdfplumber
at the same level as poppler
and better than pdfminer.six
: https://github.com/dhdaines/benchmarks
There is definitely room for improvement particularly since we can take advantage of tagged PDFs when they exist!
Thank you! This was a great catch, and the implementation makes sense to me. Nice, too, to have the preserve_order
argument available in cluster_objects(...)
, which I could see being useful in additional situations. Merging.
As the title says (though the actual change is in
cluster_objects
)! Fixes #982