Can't extract all text from one file

SteveSmirnoff commented 4 years ago

What are you trying to do?

I'm extracting text from Safety Data Sheets of different suppliers.

What code are you using to do it?

import pdfplumber
def read_pdf(path):
    try:
        with pdfplumber.open(path) as pdf_file:
            # print(pdf_file.pages[0].extract_text())
            text = ""
            for i in range(0, len(pdf_file.pages)):
                text += pdf_file.pages[i].extract_text()
            return text
    except TypeError:
        print(TypeError, path)

pdf_text = read_pdf("C:/path/to/pdf")
print(pdf_text)

PDF file

https://www.glava.no/produkter/gulv-etasjeskiller/glava-tetningsmasse/_/attachment/download/6e84b64d-f265-4ec2-828b-259680db8239:28ff832ab1d2d7245dbc84c5babc0d0598fd6c37/sikkerhetsdatablad-glava-tetningsmasse-komponent-b.pdf

Expected behavior

Extract text from each page of the pdf (including the first one)

Actual behavior

First page (pdf_file.pages[0].extract_text()) was not recorded except for the following lines:

GLAVA® Tetningsmasse, komponent B Side 1 av 10 Dette Sikkerhetsdatablad er utarbeidet i Eco Publisher (EcoOnline)

Environment

Python version: 3.7 OS: Windows 10 (without admin rights)

requirements.txt:

atomicwrites==1.4.0
attrs==19.3.0
Automat==0.8.0
bcrypt==3.1.7
brotlipy==0.7.0
certifi==2020.6.20
cffi==1.14.0
colorama==0.4.3
constantly==15.1.0
cryptography==2.9.2
cssselect==1.1.0
hyperlink==19.0.0
idna @ file:///tmp/build/80754af9/idna_1593446292537/work
importlib-metadata @ file:///C:/ci/importlib-metadata_1593446525189/work
incremental==17.5.0
lxml @ file:///C:/ci/lxml_1594826938446/work
more-itertools==8.4.0
packaging==20.4
parsel==1.5.2
pluggy==0.13.1
py @ file:///tmp/build/80754af9/py_1593446248552/work
pyasn1==0.4.8
pyasn1-modules==0.2.7
pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
PyDispatcher==2.0.5
PyHamcrest @ file:///tmp/build/80754af9/pyhamcrest_1594390921726/work
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1594392929924/work
pyparsing==2.4.7
PySocks @ file:///C:/ci/pysocks_1594394709107/work
pytest==5.4.3
pytest-runner==5.2
pywin32==227
queuelib==1.5.0
Scrapy==1.6.0
selenium @ file:///C:/ci/selenium_1594408106746/work
service-identity==18.1.0
six==1.15.0
Twisted==20.3.0
urllib3==1.25.9
w3lib==1.21.0
wcwidth @ file:///tmp/build/80754af9/wcwidth_1593447189090/work
win-inet-pton==1.1.0
wincertstore==0.2
zipp==3.1.0
zope.interface==4.7.1

Additional context

Text from other pdfs from the same source seem to be extracted as expected

samkit-jain commented 4 years ago

Hi @SmirnovStepan Thank you for your interest in the library and sharing the PDF as well as a reproducible code. You would need to repair the PDF first. Here's how to do it using GhostScript

gs -o "output.pdf" -sDEVICE=pdfwrite input.pdf

Here is the repaired file.

This is the result of running .extract_text() on the first page:

GLAVA® Tetningsmasse, komponent B Side 1 av 10

 SIKKERHETSDATABLAD
GLAVA® Tetningsmasse, 
komponent B

Sikkerhetsdatabladet er i samsvar med Kommisjonsforordning (EU) 2015/830 av 28 mai 2015 om endring av 
europaparlaments- og rådsforordning (EF) nr. 1907/2006 om registrering, vurdering, godkjenning og begrensning 
av kjemikalier (REACH)

AVSNITT 1: IDENTIFIKASJON AV STOFFET/STOFFBLANDINGEN OG AV 
SELSKAPET/FORETAKET
Utgitt dato 01.09.2016
1.1. Produktidentifikator
Kjemikaliets navn GLAVA® Tetningsmasse, komponent B
Synonymer Tettingsmasse, komp B
1.2. Identifiserte relevante bruksområder for stoffet eller stoffblandingen og bruk 
som det advares mot
Funksjon Produkt for radonsikring
1.3. Opplysninger om leverandøren av sikkerhetsdatabladet
Firmanavn Glava AS
Postadresse Nybråtveien 2 
Postnr. 1801
Poststed ASKIM
Land NORGE
Telefon 69818400
Telefaks 69818478
E-post lise.gunn.skretteberg@glava.no
Hjemmeside http://www.glava.no
1.4. Nødtelefonnummer
Nødtelefon 112 / Giftinformasjonen:(+47) 22 59 13 00

AVSNITT 2: FAREIDENTIFIKASJON
2.1. Klassifisering av stoffet eller stoffblandingen
Klassifisering merknader Acute Tox 4: Akutt giftighet.
Carc 2: Mulig fare for kreft.
Eye Irrit. 2: Alvorlig øyeirritasjon.
Resp sens 1: Sensibiliserende ved innånding.
Skinn Irrit. 2: Irriterende for huden.
Skin Sens 1: Sensibiliserende ved hudkontakt.
STOT SE 3: Spesifikk målorgantoksisitet (cid:8211) enkelteksponering.
STOT SE 2: Spesifikk målorgantoksisitet (cid:8211) gjentatt eksponering
Klassifisering i henhold til CLP (EC)  H315
No 1272/2008 [CLP/GHS] H317
H319
H332
H334
H335
Dette Sikkerhetsdatablad er utarbeidet i Eco Publisher (EcoOnline)

samkit-jain commented 4 years ago

Some more info: When running pdftotext on the PDF, the following got printed

Syntax Error (68): Illegal character <73> in hex string
Syntax Error (70): Illegal character <72> in hex string
Syntax Error (71): Illegal character <69> in hex string
Syntax Error (72): Illegal character <70> in hex string
Syntax Error (73): Illegal character <74> in hex string
Syntax Error (75): Illegal character <74> in hex string
Syntax Error (76): Illegal character <79> in hex string
Syntax Error (77): Illegal character <70> in hex string
Syntax Error (79): Illegal character <3d> in hex string
Syntax Error (80): Illegal character <22> in hex string
Syntax Error (81): Illegal character <74> in hex string
Syntax Error (83): Illegal character <78> in hex string
Syntax Error (84): Illegal character <74> in hex string
Syntax Error (85): Illegal character <2f> in hex string
Syntax Error (86): Illegal character <6a> in hex string
Syntax Error (88): Illegal character <76> in hex string
Syntax Error (90): Illegal character <73> in hex string
Syntax Error (92): Illegal character <72> in hex string
Syntax Error (93): Illegal character <69> in hex string
Syntax Error (94): Illegal character <70> in hex string
Syntax Error (95): Illegal character <74> in hex string
Syntax Error (96): Illegal character <22> in hex string
Syntax Error (68): Illegal character <73> in hex string
Syntax Error (70): Illegal character <72> in hex string
Syntax Error (71): Illegal character <69> in hex string
Syntax Error (72): Illegal character <70> in hex string
Syntax Error (73): Illegal character <74> in hex string
Syntax Error (75): Illegal character <74> in hex string
Syntax Error (76): Illegal character <79> in hex string
Syntax Error (77): Illegal character <70> in hex string
Syntax Error (79): Illegal character <3d> in hex string
Syntax Error (80): Illegal character <22> in hex string
Syntax Error (81): Illegal character <74> in hex string
Syntax Error (83): Illegal character <78> in hex string
Syntax Error (84): Illegal character <74> in hex string
Syntax Error (85): Illegal character <2f> in hex string
Syntax Error (86): Illegal character <6a> in hex string
Syntax Error (88): Illegal character <76> in hex string
Syntax Error (90): Illegal character <73> in hex string
Syntax Error (92): Illegal character <72> in hex string
Syntax Error (93): Illegal character <69> in hex string
Syntax Error (94): Illegal character <70> in hex string
Syntax Error (95): Illegal character <74> in hex string
Syntax Error (96): Illegal character <22> in hex string
Syntax Error (68): Illegal character <73> in hex string
Syntax Error (70): Illegal character <72> in hex string
Syntax Error (71): Illegal character <69> in hex string
Syntax Error (72): Illegal character <70> in hex string
Syntax Error (73): Illegal character <74> in hex string
Syntax Error (75): Illegal character <74> in hex string
Syntax Error (76): Illegal character <79> in hex string
Syntax Error (77): Illegal character <70> in hex string
Syntax Error (79): Illegal character <3d> in hex string
Syntax Error (80): Illegal character <22> in hex string
Syntax Error (81): Illegal character <74> in hex string
Syntax Error (83): Illegal character <78> in hex string
Syntax Error (84): Illegal character <74> in hex string
Syntax Error (85): Illegal character <2f> in hex string
Syntax Error (86): Illegal character <6a> in hex string
Syntax Error (88): Illegal character <76> in hex string
.
.
.

So I opened the PDF in a text editor and found the following at the top (you can also view it by opening the PDF in a text editor like Notepad++).

                <script type="text/javascript">
                   $(document).ready( function() {
                                   $('<link href="/ecosuite/css/eco_sk_common.css" type="text/css" rel="stylesheet">').appendTo("head");
                                   //                                   $('<link href="/ecosuite/css/sdseditor.css" type="text/css" rel="stylesheet">').appendTo("head");
                             //      $('<script src="/ecosuite/usrinc/js/../../js/lib/jquery-1.10.2.js" type="text/javascript">').appendTo("head");

                       //in case the css didnt load, load it here 
                    //if (!$("link[href='/ecosuite/css/eco_sk_ext.css']").length)
                    //   $('<link href="/ecosuite/css/eco_sk_ext.css" type="text/css" rel="stylesheet">').appendTo("head");
                    if (!$("link[href='/ecosuite/css/eco_sk_common.css']").length)
                        $('<link href="/ecosuite/css/eco_sk_common.css" type="text/css" rel="stylesheet">').appendTo("head");
                    if (!$("link[href='/ecosuite/css/company_style.css']").length)
                        $('<link href="/ecosuite/css/company_style.css" type="text/css" rel="stylesheet">').appendTo("head");  
                   });
                 </script>

Hence, I repaired the PDF using GhostScript. In the repaired version, this code is missing and might be the cause of text extraction not working.

samkit-jain commented 4 years ago

I am closing this issue for now as the workaround of repairing the PDF works and the issue might be better suited for pdfminer instead as this is what pdfplumber uses behind the scenes.

jsvine commented 4 years ago

Many thanks for looking into this, @samkit-jain 👍

jsvine / pdfplumber