Respect `use_text_flow` in `extract_text`

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.02k stars 619 forks source link

Respect `use_text_flow` in `extract_text` #983

Closed dhdaines closed 10 months ago

dhdaines commented 10 months ago

As the title says (though the actual change is in cluster_objects)! Fixes #982

codecov[bot] commented 10 months ago

Files Changed	Coverage Δ
pdfplumber/utils/text.py	`100.00% <ø> (ø)`
pdfplumber/utils/clustering.py	`100.00% <100.00%> (ø)`

Improves "accuracy" (the metric used is suboptimal, it should be Levenshtein on words rather than characters, but whatever) on the benchmarks mentioned above from 75% to 93%, putting pdfplumber at the same level as poppler and better than pdfminer.six: https://github.com/dhdaines/benchmarks

There is definitely room for improvement particularly since we can take advantage of tagged PDFs when they exist!

jsvine commented 10 months ago

Thank you! This was a great catch, and the implementation makes sense to me. Nice, too, to have the preserve_order argument available in cluster_objects(...), which I could see being useful in additional situations. Merging.

jsvine / pdfplumber

Respect `use_text_flow` in `extract_text` #983

Codecov Report