Closed HKAFITGlitter closed 1 month ago
The issue here appears to come from overlapping character text, partly due to invisible characters on the page. The simplest solution is to lower the y_tolerance
parameter:
Another approach, if you wanted to remove those invisible characters that spell out [Table_Info]
etc., would be to filter the page:
def not_hidden_text(obj):
return not (
obj.get("non_stroking_color") == (1,)
and obj.get("fontname") == "ABCDEE+Calibri"
)
print(page.filter(not_hidden_text).extract_text())
thanks!
Describe the bug
A clear and concise description of what the bug is. I just want to extract words in a pdf, but functions like extract_words(), extract_text(), and extract_text_line() all return some words that don't display in a pdf. For example, it is supposed to return "观点聚焦Investment" but "[观Ta点bl聚e_焦yem Inevie1s]. why does it return "Ta", 'bl', 'e', and 'yem Inevis 1s]'. How to remove these characters? I just want to extract words explicitly displaying in a pdf.
Code to reproduce the problem
python code: import pdfplumber pdf_path = r"test.pdf" with pdfplumber.open(pdf_path) as pdf: page_0 = pdf.pages[0] print(page_0.extract_text())
Paste it here, or attach a Python file.
PDF file
Please attach any PDFs necessary to reproduce the problem. test.pdf
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Expected behavior
it is supposed to return "观点聚焦 Investment Focus"
Actual behavior
However, it returns "[观Ta点bl聚e_焦yem Inevie1s] tment Focus".
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
Add any other context/notes about the problem here.