Closed SteveSmirnoff closed 4 years ago
Hi @SmirnovStepan Thank you for your interest in the library and sharing the PDF as well as a reproducible code. You would need to repair the PDF first. Here's how to do it using GhostScript
gs -o "output.pdf" -sDEVICE=pdfwrite input.pdf
Here is the repaired file.
This is the result of running .extract_text()
on the first page:
GLAVA® Tetningsmasse, komponent B Side 1 av 10
SIKKERHETSDATABLAD
GLAVA® Tetningsmasse,
komponent B
Sikkerhetsdatabladet er i samsvar med Kommisjonsforordning (EU) 2015/830 av 28 mai 2015 om endring av
europaparlaments- og rådsforordning (EF) nr. 1907/2006 om registrering, vurdering, godkjenning og begrensning
av kjemikalier (REACH)
AVSNITT 1: IDENTIFIKASJON AV STOFFET/STOFFBLANDINGEN OG AV
SELSKAPET/FORETAKET
Utgitt dato 01.09.2016
1.1. Produktidentifikator
Kjemikaliets navn GLAVA® Tetningsmasse, komponent B
Synonymer Tettingsmasse, komp B
1.2. Identifiserte relevante bruksområder for stoffet eller stoffblandingen og bruk
som det advares mot
Funksjon Produkt for radonsikring
1.3. Opplysninger om leverandøren av sikkerhetsdatabladet
Firmanavn Glava AS
Postadresse Nybråtveien 2
Postnr. 1801
Poststed ASKIM
Land NORGE
Telefon 69818400
Telefaks 69818478
E-post lise.gunn.skretteberg@glava.no
Hjemmeside http://www.glava.no
1.4. Nødtelefonnummer
Nødtelefon 112 / Giftinformasjonen:(+47) 22 59 13 00
AVSNITT 2: FAREIDENTIFIKASJON
2.1. Klassifisering av stoffet eller stoffblandingen
Klassifisering merknader Acute Tox 4: Akutt giftighet.
Carc 2: Mulig fare for kreft.
Eye Irrit. 2: Alvorlig øyeirritasjon.
Resp sens 1: Sensibiliserende ved innånding.
Skinn Irrit. 2: Irriterende for huden.
Skin Sens 1: Sensibiliserende ved hudkontakt.
STOT SE 3: Spesifikk målorgantoksisitet (cid:8211) enkelteksponering.
STOT SE 2: Spesifikk målorgantoksisitet (cid:8211) gjentatt eksponering
Klassifisering i henhold til CLP (EC) H315
No 1272/2008 [CLP/GHS] H317
H319
H332
H334
H335
Dette Sikkerhetsdatablad er utarbeidet i Eco Publisher (EcoOnline)
Some more info:
When running pdftotext
on the PDF, the following got printed
Syntax Error (68): Illegal character <73> in hex string
Syntax Error (70): Illegal character <72> in hex string
Syntax Error (71): Illegal character <69> in hex string
Syntax Error (72): Illegal character <70> in hex string
Syntax Error (73): Illegal character <74> in hex string
Syntax Error (75): Illegal character <74> in hex string
Syntax Error (76): Illegal character <79> in hex string
Syntax Error (77): Illegal character <70> in hex string
Syntax Error (79): Illegal character <3d> in hex string
Syntax Error (80): Illegal character <22> in hex string
Syntax Error (81): Illegal character <74> in hex string
Syntax Error (83): Illegal character <78> in hex string
Syntax Error (84): Illegal character <74> in hex string
Syntax Error (85): Illegal character <2f> in hex string
Syntax Error (86): Illegal character <6a> in hex string
Syntax Error (88): Illegal character <76> in hex string
Syntax Error (90): Illegal character <73> in hex string
Syntax Error (92): Illegal character <72> in hex string
Syntax Error (93): Illegal character <69> in hex string
Syntax Error (94): Illegal character <70> in hex string
Syntax Error (95): Illegal character <74> in hex string
Syntax Error (96): Illegal character <22> in hex string
Syntax Error (68): Illegal character <73> in hex string
Syntax Error (70): Illegal character <72> in hex string
Syntax Error (71): Illegal character <69> in hex string
Syntax Error (72): Illegal character <70> in hex string
Syntax Error (73): Illegal character <74> in hex string
Syntax Error (75): Illegal character <74> in hex string
Syntax Error (76): Illegal character <79> in hex string
Syntax Error (77): Illegal character <70> in hex string
Syntax Error (79): Illegal character <3d> in hex string
Syntax Error (80): Illegal character <22> in hex string
Syntax Error (81): Illegal character <74> in hex string
Syntax Error (83): Illegal character <78> in hex string
Syntax Error (84): Illegal character <74> in hex string
Syntax Error (85): Illegal character <2f> in hex string
Syntax Error (86): Illegal character <6a> in hex string
Syntax Error (88): Illegal character <76> in hex string
Syntax Error (90): Illegal character <73> in hex string
Syntax Error (92): Illegal character <72> in hex string
Syntax Error (93): Illegal character <69> in hex string
Syntax Error (94): Illegal character <70> in hex string
Syntax Error (95): Illegal character <74> in hex string
Syntax Error (96): Illegal character <22> in hex string
Syntax Error (68): Illegal character <73> in hex string
Syntax Error (70): Illegal character <72> in hex string
Syntax Error (71): Illegal character <69> in hex string
Syntax Error (72): Illegal character <70> in hex string
Syntax Error (73): Illegal character <74> in hex string
Syntax Error (75): Illegal character <74> in hex string
Syntax Error (76): Illegal character <79> in hex string
Syntax Error (77): Illegal character <70> in hex string
Syntax Error (79): Illegal character <3d> in hex string
Syntax Error (80): Illegal character <22> in hex string
Syntax Error (81): Illegal character <74> in hex string
Syntax Error (83): Illegal character <78> in hex string
Syntax Error (84): Illegal character <74> in hex string
Syntax Error (85): Illegal character <2f> in hex string
Syntax Error (86): Illegal character <6a> in hex string
Syntax Error (88): Illegal character <76> in hex string
.
.
.
So I opened the PDF in a text editor and found the following at the top (you can also view it by opening the PDF in a text editor like Notepad++).
<script type="text/javascript">
$(document).ready( function() {
$('<link href="/ecosuite/css/eco_sk_common.css" type="text/css" rel="stylesheet">').appendTo("head");
// $('<link href="/ecosuite/css/sdseditor.css" type="text/css" rel="stylesheet">').appendTo("head");
// $('<script src="/ecosuite/usrinc/js/../../js/lib/jquery-1.10.2.js" type="text/javascript">').appendTo("head");
//in case the css didnt load, load it here
//if (!$("link[href='/ecosuite/css/eco_sk_ext.css']").length)
// $('<link href="/ecosuite/css/eco_sk_ext.css" type="text/css" rel="stylesheet">').appendTo("head");
if (!$("link[href='/ecosuite/css/eco_sk_common.css']").length)
$('<link href="/ecosuite/css/eco_sk_common.css" type="text/css" rel="stylesheet">').appendTo("head");
if (!$("link[href='/ecosuite/css/company_style.css']").length)
$('<link href="/ecosuite/css/company_style.css" type="text/css" rel="stylesheet">').appendTo("head");
});
</script>
Hence, I repaired the PDF using GhostScript. In the repaired version, this code is missing and might be the cause of text extraction not working.
I am closing this issue for now as the workaround of repairing the PDF works and the issue might be better suited for pdfminer instead as this is what pdfplumber uses behind the scenes.
Many thanks for looking into this, @samkit-jain 👍
What are you trying to do?
I'm extracting text from Safety Data Sheets of different suppliers.
What code are you using to do it?
PDF file
https://www.glava.no/produkter/gulv-etasjeskiller/glava-tetningsmasse/_/attachment/download/6e84b64d-f265-4ec2-828b-259680db8239:28ff832ab1d2d7245dbc84c5babc0d0598fd6c37/sikkerhetsdatablad-glava-tetningsmasse-komponent-b.pdf
Expected behavior
Extract text from each page of the pdf (including the first one)
Actual behavior
First page (pdf_file.pages[0].extract_text()) was not recorded except for the following lines:
Environment
Python version: 3.7 OS: Windows 10 (without admin rights)
requirements.txt:
Additional context
Text from other pdfs from the same source seem to be extracted as expected