Closed NeoWang9999 closed 4 years ago
Hi @NeoWang9999 Thanks for your interest in the library. The reason pdfplumber
is not able to extract text is because this PDF does not contain any embedded font.
The reason pdfbox
is able to extract could most likely because of it substituting the missing font. When the PDF viewer substitutes a font, the result usually isn't what you intended.
Hi @samkit-jain thanks for your reply! according
"The reason pdfbox is able to extract could most likely because of it substituting the missing font."
Is there any effort to make pdfplumber achieve the same results? Because it looks like the necessary information does exist in the file, just we need to do some extra processing to read it correctly. thank you.
@NeoWang9999 This feature request would be more suited for pdfminer since that is what pdfplumber
relies upon. I am closing this issue for the same reason. Feel free to reopen if you have a different proposition.
As an aside:
The reason pdfbox is able to extract could most likely because of it substituting the missing font.
No, there is no need for that, the PDF objects describing the font provide all the information needed, an Encoding value of GB-EUC-H or GB-EUC-V and a CIDSystemInfo ROS Adobe-GB1-0.
Describe the bug
page.extract_words() and page.extract_text() output is empty
Code to reproduce the problem
PDF file
Sponsorship Agreement Template (CN) 1-未改动.pdf
Expected behavior
output is not empty
Environment
other note
when i tried with pdfbox like:
it worked fine and output the txt file. but I need the chars coordinate information.
Please help me solve this problem, thank you.