page.extract_words() and page.extract_text() output is empty

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.57k stars 659 forks source link

page.extract_words() and page.extract_text() output is empty #269

Closed NeoWang9999 closed 4 years ago

NeoWang9999 commented 4 years ago

Describe the bug

page.extract_words() and page.extract_text() output is empty

Code to reproduce the problem

import pdfplumber
pdf = pdfplumber.open(pdf_path, laparams={"word_margin": 1.0})
for p in pdf.pages:
    words = p.extract_words(keep_blank_chars=True)
    texts = p.extract_text()
    print(words, texts)

PDF file

Sponsorship Agreement Template (CN) 1-未改动.pdf

Expected behavior

output is not empty

Environment

pdfplumber version: [0.5.23]
Python version: [3.7.7]
OS: [Mac OS]

other note

when i tried with pdfbox like:


p = pdfbox.PDFBox()
p.extract_text(input_path=f_path, console=False)

it worked fine and output the txt file. but I need the chars coordinate information.

Please help me solve this problem, thank you.

samkit-jain commented 4 years ago

Hi @NeoWang9999 Thanks for your interest in the library. The reason pdfplumber is not able to extract text is because this PDF does not contain any embedded font. The reason pdfbox is able to extract could most likely because of it substituting the missing font. When the PDF viewer substitutes a font, the result usually isn't what you intended.

NeoWang9999 commented 4 years ago

Hi @samkit-jain thanks for your reply！ according

"The reason pdfbox is able to extract could most likely because of it substituting the missing font."

Is there any effort to make pdfplumber achieve the same results? Because it looks like the necessary information does exist in the file, just we need to do some extra processing to read it correctly. thank you.

samkit-jain commented 4 years ago

@NeoWang9999 This feature request would be more suited for pdfminer since that is what pdfplumber relies upon. I am closing this issue for the same reason. Feel free to reopen if you have a different proposition.

mkl-public commented 4 years ago

As an aside:

The reason pdfbox is able to extract could most likely because of it substituting the missing font.

No, there is no need for that, the PDF objects describing the font provide all the information needed, an Encoding value of GB-EUC-H or GB-EUC-V and a CIDSystemInfo ROS Adobe-GB1-0.