timClicks / slate

The simplest way to extract text from PDFs in Python
http://timmcnamara.co.nz/
GNU General Public License v3.0
428 stars 139 forks source link

Text extraction fails on PDF with text watermark #30

Open dalenavi opened 8 years ago

dalenavi commented 8 years ago

Using slate installed with pip install slate==0.3 pdfminer==20110515

In [4]: pdf = slate.PDF(virtualFile)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-f5142e9f8ced> in <module>()
----> 1 pdf = slate.PDF(virtualFile)

/usr/local/lib/python2.7/dist-packages/slate/slate.pyc in __init__(self, file, password, just_text)
     47             self.metadata = self.doc.info
     48         if just_text:
---> 49             self._cleanup()
     50 
     51     def _cleanup(self):

/usr/local/lib/python2.7/dist-packages/slate/slate.pyc in _cleanup(self)
     55         PDF.
     56         """
---> 57         del self.device
     58         del self.doc
     59         del self.parser

AttributeError: device

using slate installed from the repository, with pdfminer==20140328 slate.PDF executes without errors, but returns the empty array, []

One of the many consistently failing PDF: FailingDocument.pdf