jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

extracted word is broken #964

Closed jnhyperion closed 10 months ago

jnhyperion commented 11 months ago

Code to reproduce the problem

page.extract_text()

PDF file

example.pdf

Expected behavior

extracted line:

VLHDU8SHRR Homeowner Discount .....

Actual behavior

VLHDU8SHRR H o m e o w ner Discount .....

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

using param use_text_flow=True will avoid this bug, but this param will cause other extract format bugs like:

expected: foo: bar actual: bar foo:

jsvine commented 11 months ago

Hi @jnhyperion, and thanks for your interest in this library. Have you tried adjusting the x_tolerance parameter? (See this section of the README.md for more detail.)

jnhyperion commented 11 months ago

@jsvine I've tried already (with the param range from 0.001~3000), and it's not working for my case.

page.extract_text(x_tolerance=0.001)
page.extract_text(x_tolerance=3000)
jsvine commented 11 months ago

Thanks for clarifying. It appears that the issue stems from the PDF including extraneous whitespace characters, in particular a long string of them that overlap with the "Homeowner" text:

import pdfplumber

pdf = pdfplumber.open("./example.pdf")
page = pdf.pages[0]
im = page.to_image()

whitespace_chars = [ c for c in page.chars if c["text"] == " " ]
im.reset().draw_rects(whitespace_chars)

image

To resolve this, you'll want to filter out those whitespace characters (something that pdfplumber does not do automatically, because many PDFs need them for correct extraction):

filtered = page.filter(lambda obj: obj.get("text") != " ")
print(filtered.extract_text(x_tolerance=1))

Returns what I think you want (although perhaps you also want layout=True?):

Policy Number: VLHDU8SHRR
primary driver
Page 3 of 3
Premium discounts
Policy
VLHDU8SHRR Homeowner Discount, PaperLess Discount, E-Signature Discount, Online Quote
Discount, Continuous Insurance Discount, Automatic Card Payments Discount,
Advance Quote Factor
Failure to pay renewal premium
If you do not pay the minimum amount due on or before the due date, your coverage will end on 02/09/2024.
However, if your payment is received or postmarked by 02/10/2024, we will renew your policy with a lapse in
coverage. Your coverage will be renewed the day after your payment is received or postmarked.
Form WI005 (02/22)
jnhyperion commented 11 months ago

after using this solution for another pdf files, another issue occurs:

test code:

import os
import pdfplumber

page_texts1 = []
page_texts2 = []
example_pdf = "example.pdf"
with pdfplumber.open(example_pdf) as f:
    for page in f.pages:
        filtered = page.filter(lambda obj: obj.get("text") != " ")
        page_texts1.append(
            filtered.extract_text(x_tolerance=1)
        )
with pdfplumber.open(example_pdf) as f:
    for page in f.pages:
        page_texts2.append(
            page.extract_text()
        )

c1 = "\n".join(page_texts1)
c2 = "\n".join(page_texts2)

import difflib

c = difflib.HtmlDiff().make_file(c1.splitlines(), c2.splitlines())
with open("report.html", "w") as f:
    f.write(c)
    os.system("open report.html")

the output:

image


image



So probably, adjusting x_tolerance will not completely solve this issue, because in different pdf file the x_tolerance could be different, and even in the same file, but different lines, the x_tolerance could be different too.

jsvine commented 11 months ago

So probably, adjusting x_tolerance will not completely solve this issue, because in different pdf file the x_tolerance could be different, and even in the same file, but different lines, the x_tolerance could be different too.

Yes, this is certainly the case, as PDFs themselves are quite varied and designed in a enormous range of styles/layouts/etc.

The core functions of pdfplumber are not meant to provide a universal solution for all PDFs, but rather to give the user control over extraction of individual PDFs, and to keep the logic as simple as possible. That said, pdfplumber should hopefully enable the construction of more complex extraction logic. For instance, depending on your corpus of PDFs, you may want first to group/analyze text by font size, if particularly-small text is a concern.

It's also possible that, if you're just looking for a universal text-extractor, another tool may solve this problem more directly.

jnhyperion commented 11 months ago

I see, thanks for your explanation. At least my PR https://github.com/jsvine/pdfplumber/pull/965 resolves this issue and does not import other issues as well (only tested on our few pdf files).

jsvine commented 10 months ago

Thanks, I'll close this issue, and continue the discussion in the PR.