function "extract_words" extract words that don't exist in a pdf.

HKAFITGlitter commented 2 months ago

Describe the bug

A clear and concise description of what the bug is. I just want to extract words in a pdf, but functions like extract_words(), extract_text(), and extract_text_line() all return some words that don't display in a pdf. For example, it is supposed to return "观点聚焦Investment" but "[观Ta点bl聚e_焦yem Inevie1s]. why does it return "Ta", 'bl', 'e', and 'yem Inevis 1s]'. How to remove these characters? I just want to extract words explicitly displaying in a pdf.

Code to reproduce the problem

python code: import pdfplumber pdf_path = r"test.pdf" with pdfplumber.open(pdf_path) as pdf: page_0 = pdf.pages[0] print(page_0.extract_text())

Paste it here, or attach a Python file.

PDF file

Please attach any PDFs necessary to reproduce the problem. test.pdf

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

it is supposed to return "观点聚焦 Investment Focus"

Actual behavior

However, it returns "[观Ta点bl聚e_焦yem Inevie1s] tment Focus".

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

pdfplumber version: 0.10.3
Python version: 3.10
OS: windows

Additional context

Add any other context/notes about the problem here.

jsvine commented 1 month ago

The issue here appears to come from overlapping character text, partly due to invisible characters on the page. The simplest solution is to lower the y_tolerance parameter:

Another approach, if you wanted to remove those invisible characters that spell out [Table_Info] etc., would be to filter the page:

def not_hidden_text(obj):
    return not (
        obj.get("non_stroking_color") == (1,)
        and obj.get("fontname") == "ABCDEE+Calibri"
    )

print(page.filter(not_hidden_text).extract_text())

HKAFITGlitter commented 1 month ago

thanks!

jsvine / pdfplumber