Closed jnhyperion closed 10 months ago
Hi @jnhyperion, and thanks for your interest in this library. Have you tried adjusting the x_tolerance
parameter? (See this section of the README.md for more detail.)
@jsvine I've tried already (with the param range from 0.001~3000), and it's not working for my case.
page.extract_text(x_tolerance=0.001)
page.extract_text(x_tolerance=3000)
Thanks for clarifying. It appears that the issue stems from the PDF including extraneous whitespace characters, in particular a long string of them that overlap with the "Homeowner" text:
import pdfplumber
pdf = pdfplumber.open("./example.pdf")
page = pdf.pages[0]
im = page.to_image()
whitespace_chars = [ c for c in page.chars if c["text"] == " " ]
im.reset().draw_rects(whitespace_chars)
To resolve this, you'll want to filter out those whitespace characters (something that pdfplumber
does not do automatically, because many PDFs need them for correct extraction):
filtered = page.filter(lambda obj: obj.get("text") != " ")
print(filtered.extract_text(x_tolerance=1))
Returns what I think you want (although perhaps you also want layout=True
?):
Policy Number: VLHDU8SHRR
primary driver
Page 3 of 3
Premium discounts
Policy
VLHDU8SHRR Homeowner Discount, PaperLess Discount, E-Signature Discount, Online Quote
Discount, Continuous Insurance Discount, Automatic Card Payments Discount,
Advance Quote Factor
Failure to pay renewal premium
If you do not pay the minimum amount due on or before the due date, your coverage will end on 02/09/2024.
However, if your payment is received or postmarked by 02/10/2024, we will renew your policy with a lapse in
coverage. Your coverage will be renewed the day after your payment is received or postmarked.
Form WI005 (02/22)
after using this solution for another pdf files, another issue occurs:
test code:
import os
import pdfplumber
page_texts1 = []
page_texts2 = []
example_pdf = "example.pdf"
with pdfplumber.open(example_pdf) as f:
for page in f.pages:
filtered = page.filter(lambda obj: obj.get("text") != " ")
page_texts1.append(
filtered.extract_text(x_tolerance=1)
)
with pdfplumber.open(example_pdf) as f:
for page in f.pages:
page_texts2.append(
page.extract_text()
)
c1 = "\n".join(page_texts1)
c2 = "\n".join(page_texts2)
import difflib
c = difflib.HtmlDiff().make_file(c1.splitlines(), c2.splitlines())
with open("report.html", "w") as f:
f.write(c)
os.system("open report.html")
the output:
So probably, adjusting x_tolerance
will not completely solve this issue, because in different pdf file the x_tolerance
could be different, and even in the same file, but different lines, the x_tolerance
could be different too.
So probably, adjusting x_tolerance will not completely solve this issue, because in different pdf file the x_tolerance could be different, and even in the same file, but different lines, the x_tolerance could be different too.
Yes, this is certainly the case, as PDFs themselves are quite varied and designed in a enormous range of styles/layouts/etc.
The core functions of pdfplumber
are not meant to provide a universal solution for all PDFs, but rather to give the user control over extraction of individual PDFs, and to keep the logic as simple as possible. That said, pdfplumber
should hopefully enable the construction of more complex extraction logic. For instance, depending on your corpus of PDFs, you may want first to group/analyze text by font size, if particularly-small text is a concern.
It's also possible that, if you're just looking for a universal text-extractor, another tool may solve this problem more directly.
I see, thanks for your explanation. At least my PR https://github.com/jsvine/pdfplumber/pull/965 resolves this issue and does not import other issues as well (only tested on our few pdf files).
Thanks, I'll close this issue, and continue the discussion in the PR.
Code to reproduce the problem
PDF file
example.pdf
Expected behavior
extracted line:
VLHDU8SHRR Homeowner Discount .....
Actual behavior
VLHDU8SHRR H o m e o w ner Discount .....
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
using param
use_text_flow=True
will avoid this bug, but this param will cause other extract format bugs like:expected:
foo: bar
actual:bar foo: