Closed SteveSmirnoff closed 4 years ago
To extract text when Identity-H encoding is used, the PDF should have a "ToUnicode CMap" which it appears this PDF does not have as when you copy-paste text from the PDF, it is printed out as gibberish like .
Reference: https://tex.stackexchange.com/a/526168
I am closing this issue since it is more related to how the PDF is created and there's not much that pdfplumber
can do about it.
What are you trying to do?
I'm extracting text from Safety Data Sheets of different suppliers.
What code are you using to do it?
PDF file
https://chestertondocs.chesterton.com/Lubricants/218(E)%20HDP_B-NO.pdf
Expected behavior
Extract text from each page of the pdf
Actual behavior
Screenshots
Won't help.
Environment
Python version: 3.7 OS: Windows 10 (without admin rights)
requirements.txt:
Additional context
Text from 75% of other pdfs from the same source are extracted as expected. 25% have this problem. It might be the encoding of the files.