jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Text from pdf is not readable. #249

Closed SteveSmirnoff closed 4 years ago

SteveSmirnoff commented 4 years ago

What are you trying to do?

I'm extracting text from Safety Data Sheets of different suppliers.

What code are you using to do it?

import pdfplumber
def read_pdf(path):
    try:
        with pdfplumber.open(path) as pdf_file:
            text = ""
            for i in range(0, len(pdf_file.pages)):
                text += pdf_file.pages[i].extract_text()
            return text
    except TypeError:
        print(TypeError, path)

pdf_text = read_pdf("C:/path/to/pdf")
print(pdf_text)

PDF file

https://chestertondocs.chesterton.com/Lubricants/218(E)%20HDP_B-NO.pdf

Expected behavior

Extract text from each page of the pdf

Actual behavior

(cid:4)(cid:5)(cid:6)(cid:6)(cid:7)(cid:8)(cid:9)(cid:7)(cid:10)(cid:11)(cid:12)(cid:1)(cid:10)(cid:1)(cid:13)(cid:14)(cid:1)(cid:12)
(cid:6)(cid:7)(cid:8)(cid:3)(cid:9)(cid:8)(cid:10)(cid:11)(cid:12)(cid:7)(cid:13)(cid:6)(cid:11)(cid:7)(cid:14)(cid:10)(cid:15)(cid:10)(cid:15)(cid:12)(cid:9)(cid:6)(cid:9)(cid:16)(cid:7)(cid:17)(cid:18)(cid:19)(cid:20)(cid:7)(cid:9)(cid:15)(cid:21)(cid:7)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:27)(cid:24)(cid:24)(cid:28)
(cid:15)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)
(cid:29)(cid:3)(cid:30)(cid:6)(cid:31)(cid:32)(cid:10)(cid:9)(cid:31)(cid:12)(cid:1)(cid:13)(cid:10)(cid:33)(cid:7)(cid:27)(cid:25)(cid:21)(cid:24)(cid:27)(cid:21)(cid:27)(cid:24)(cid:22)(cid:23)(cid:7) (cid:34)(cid:6)(cid:12)(cid:3)(cid:7)(cid:22)(cid:7)(cid:1)(cid:30)(cid:7)(cid:22)(cid:35)
(cid:3)(cid:25)(cid:4)(cid:26)(cid:27)(cid:28)(cid:28)(cid:21)(cid:16)(cid:29)(cid:21)(cid:27)(cid:12)(cid:7)(cid:30)(cid:10)(cid:5)(cid:31)(cid:5)(cid:6)(cid:1)(cid:11)(cid:32)(cid:33)(cid:30)(cid:21)(cid:1)(cid:34)(cid:21)(cid:11)(cid:10)(cid:33)(cid:31)(cid:31)(cid:7)(cid:10)(cid:35)(cid:11)(cid:10)(cid:33)(cid:31)(cid:31)(cid:13)(cid:14)(cid:1)(cid:30)(cid:12)(cid:5)(cid:30)(cid:36)(cid:7)(cid:30)(cid:21)(cid:33)(cid:36)(cid:21)(cid:1)(cid:34)(cid:21)(cid:11)(cid:7)(cid:14)(cid:11)(cid:6)(cid:1)(cid:37)(cid:7)(cid:10)(cid:35)(cid:31)(cid:33)(cid:8)(cid:7)(cid:10)(cid:1)(cid:6)(cid:7)(cid:10)
(cid:16)(cid:38)(cid:16)(cid:38)(cid:21)(cid:24)(cid:8)(cid:33)(cid:12)(cid:39)(cid:6)(cid:10)(cid:5)(cid:12)(cid:7)(cid:30)(cid:10)(cid:5)(cid:31)(cid:5)(cid:6)(cid:1)(cid:10)(cid:33)(cid:8)
(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)

Screenshots

Won't help.

Environment

Python version: 3.7 OS: Windows 10 (without admin rights)

requirements.txt:

atomicwrites==1.4.0
attrs==19.3.0
Automat==0.8.0
bcrypt==3.1.7
brotlipy==0.7.0
certifi==2020.6.20
cffi==1.14.0
colorama==0.4.3
constantly==15.1.0
cryptography==2.9.2
cssselect==1.1.0
hyperlink==19.0.0
idna @ file:///tmp/build/80754af9/idna_1593446292537/work
importlib-metadata @ file:///C:/ci/importlib-metadata_1593446525189/work
incremental==17.5.0
lxml @ file:///C:/ci/lxml_1594826938446/work
more-itertools==8.4.0
packaging==20.4
parsel==1.5.2
pluggy==0.13.1
py @ file:///tmp/build/80754af9/py_1593446248552/work
pyasn1==0.4.8
pyasn1-modules==0.2.7
pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
PyDispatcher==2.0.5
PyHamcrest @ file:///tmp/build/80754af9/pyhamcrest_1594390921726/work
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1594392929924/work
pyparsing==2.4.7
PySocks @ file:///C:/ci/pysocks_1594394709107/work
pytest==5.4.3
pytest-runner==5.2
pywin32==227
queuelib==1.5.0
Scrapy==1.6.0
selenium @ file:///C:/ci/selenium_1594408106746/work
service-identity==18.1.0
six==1.15.0
Twisted==20.3.0
urllib3==1.25.9
w3lib==1.21.0
wcwidth @ file:///tmp/build/80754af9/wcwidth_1593447189090/work
win-inet-pton==1.1.0
wincertstore==0.2
zipp==3.1.0
zope.interface==4.7.1

Additional context

Text from 75% of other pdfs from the same source are extracted as expected. 25% have this problem. It might be the encoding of the files.

samkit-jain commented 4 years ago

image

To extract text when Identity-H encoding is used, the PDF should have a "ToUnicode CMap" which it appears this PDF does not have as when you copy-paste text from the PDF, it is printed out as gibberish like .

Reference: https://tex.stackexchange.com/a/526168

I am closing this issue since it is more related to how the PDF is created and there's not much that pdfplumber can do about it.