jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

When I use extract_text and extract_words the input is empty #887

Closed Mankvis closed 1 year ago

Mankvis commented 1 year ago

Describe the bug

When I use extract_text and extract_words the input is empty

Code to reproduce the problem

with pdfplumber.open(pdf_path) as pdf: first_page = pdf.pages[0] print(first_page.extract_text())

PDF file

demo.pdf

Expected behavior

The Chinese text content of the pdf that should be output

Environment

jsvine commented 1 year ago

Hi @Mankvis, and thanks for your interest in this library. The page you've shared is, unfortunately, an image-based PDF page, containing no digital information directly from text. You can confirm this by trying to select the text from the page and paste it into a text document.

See this comment for an example of what you could do next: https://github.com/jsvine/pdfplumber/discussions/717#discussioncomment-3476384