Closed dhdaines closed 10 months ago
Thanks! PR merged.
Thanks! PR merged. @jsvine Haha, this is just what I need, I use pymupdf to extract texts underlying flow before. By the way can you also revise your document description for extract_text in home page. It should more helpful for somebody new.
Describe the bug
With a multi-column PDF such as the ones used in the benchmarks at https://github.com/py-pdf/benchmarks/tree/main, one might expect
use_text_flow=True
to do the same thing withextract_text
as it does withextract_words
, namely, return the words in the order they appear in the PDF file. This is unfortunately not the case.Code to reproduce the problem
PDF file
https://arxiv.org/pdf/1601.03642.pdf
Expected behavior
The output of
extract_text
should follow the flow order of the document, e.g. should look something like it does withextract_words
:Actual behavior
It doesn't do that :)
Environment
Additional context
The explanation is pretty simple - while
WordExtractor
takes care to not sort the words whenuse_text_flow
isTrue
, andWordMap
makes a good faith effort, it then proceeds to pass this not-sorted list tocluster_objects
which ... sorts it unconditionally.The fix seems to be equally simple. Just don't sort in
cluster_objects
ifuse_text_flow
orpresorted
isTrue
. PR to follow.This should help
pdfplumber
achieve better scores on the benchmark above, since this is the main thing thatpdfminer.six
for example is doing differently to get a better score. But it's also an actual bug :)