timClicks / slate

The simplest way to extract text from PDFs in Python
http://timmcnamara.co.nz/
GNU General Public License v3.0
425 stars 139 forks source link

AttributeError when opening documents that contain chinese characters #7

Open Alexander-0x80 opened 10 years ago

Alexander-0x80 commented 10 years ago

When opening documents that contain chinese/thai characters i get an exception saying :

File "create_index.py", line 16, in <module>
    pdf_data = slate.PDF(f)
  File "/home/alexander/dev/pdf_indexer/env/local/lib/python2.7/site-packages/slate/slate.py", line 49, in __init__
    self._cleanup()
  File "/home/alexander/dev/pdf_indexer/env/local/lib/python2.7/site-packages/slate/slate.py", line 57, in _cleanup
    del self.device
AttributeError: device
timClicks commented 9 years ago

Could you provide an example PDF so that I could create a test case?

bsmartt13 commented 9 years ago

I'm seeing this with this PDF: . However the only version of pdfminer/slate that I've gotten to work is slate==0.3 pdfminer==20110515 so please keep that in mind..

Traceback (most recent call last):
...
    doc = slate.PDF(read_handle)
  File "/Users/bsmartt/reputation_env/lib/python2.7/site-packages/slate/slate.py", line 49, in __init__
    self._cleanup()
  File "/Users/bsmartt/reputation_env/lib/python2.7/site-packages/slate/slate.py", line 57, in _cleanup
    del self.device
AttributeError: device

Happens with this pdf: http://www.emc.com/collateral/white-papers/h12756-wp-shell-crew.pdf

At a glance I didn't notice any chinese/thai, but there are definitely some funky characters in there.

edit: omfg. Preview.app is coming to the rescue. "Without the proper password, you do not have permission to copy portions of this document. Enter the password to unlock copying from the document."... fuck my life, and have a great day @timClicks :+1:

bsmartt13 commented 9 years ago

I realize slate supports passing the password into the constructor, however, it would be cool to fail gracefully here (if we didn't know a password was needed, for example).

I'm in a position where it would be awesome if I could recover (catch) this as a different exception than the AttributeError shown above (so as to be certain it was due to password protection, and not some other failure), I could notify the user 'hey, the pdf you gave us is password protected, please give us the password to continue.'.

Thanks Tim

timClicks commented 8 years ago

@bsmartt13 hey sorry I've taken months to get back to you, issuing some kind of useful if we encounter a PDF that expects a password seems very useful.