jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

`extract_text(use_text_flow=True)` apparently does nothing #982

Closed dhdaines closed 10 months ago

dhdaines commented 10 months ago

Describe the bug

With a multi-column PDF such as the ones used in the benchmarks at https://github.com/py-pdf/benchmarks/tree/main, one might expect use_text_flow=True to do the same thing with extract_text as it does with extract_words, namely, return the words in the order they appear in the PDF file. This is unfortunately not the case.

Code to reproduce the problem

with pdfplumber.open("1601.03642.pdf") as pdf:
    page = pdf.pages[0]
    print(page.extract_text(use_text_flow=True)[100:250])
    print(" ".join(x["text"] for x in page.extract_words(use_text_flow=True)[9:30]))

PDF file

https://arxiv.org/pdf/1601.03642.pdf

Expected behavior

The output of extract_text should follow the flow order of the document, e.g. should look something like it does with extract_words:

Abstract—Recent machine learning techniques can be modified to produce creative results. Those results did not exist before; it is not a

Actual behavior

It doesn't do that :)

de x w3
3 .
wn . .
xn
Abstract—Recent machine learning techniques can be modified (a) Exampleofanartificialneuronunit. (b) Avisualizationofasimplefeed

Environment

Additional context

The explanation is pretty simple - while WordExtractor takes care to not sort the words when use_text_flow is True, and WordMap makes a good faith effort, it then proceeds to pass this not-sorted list to cluster_objects which ... sorts it unconditionally.

The fix seems to be equally simple. Just don't sort in cluster_objects if use_text_flow or presorted is True. PR to follow.

This should help pdfplumber achieve better scores on the benchmark above, since this is the main thing that pdfminer.six for example is doing differently to get a better score. But it's also an actual bug :)

jsvine commented 10 months ago

Thanks! PR merged.

cobaltautomationdev commented 7 months ago

Thanks! PR merged. @jsvine Haha, this is just what I need, I use pymupdf to extract texts underlying flow before. By the way can you also revise your document description for extract_text in home page. It should more helpful for somebody new.