pmaupin / pdfrw

pdfrw is a pure Python library that reads and writes PDFs
Other
1.87k stars 274 forks source link

PdfReader reader fails in decryption #179

Open arshad01 opened 5 years ago

arshad01 commented 5 years ago

Hello

I am using pdfrw to read an encrypted file. The file does not need a password to open it and I can view it in Adobe Reader. When opening with PdfReader I am getting an exception.

$ python
Python 2.7.10 (default, Jan 30 2019, 03:22:04) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdfrw
>>> pdfrw.PdfReader('Encrypted.pdf', decrypt=True, decompress=True)
[WARNING] tokens.py:221 Did not find PDF object (197, 0) (line=2076, col=1, token='startxref')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/prometheus/pdfrw/lib/python2.7/site-packages/pdfrw/pdfreader.py", line 645, in __init__
    self._parse_encrypt_info(source, password, trailer)
  File "/home/prometheus/pdfrw/lib/python2.7/site-packages/pdfrw/pdfreader.py", line 499, in _parse_encrypt_info
    key = crypt.create_key(password, trailer)
  File "/home/prometheus/pdfrw/lib/python2.7/site-packages/pdfrw/crypt.py", line 31, in create_key
    key_size = int(doc.Encrypt.Length or 40) // 8
AttributeError: 'NoneType' object has no attribute 'Length'

It seems like that the issue is being cause by not being able to find the object (197, 0) even though it is present in the pdf file. Object (197, 0) contains the details of the encryption.

Any help in solving this issue is greatly appreciated. Thanks

(Edit: Sample pdf can be downloaded from https://www.proofpoint.com/us/resources/white-papers/who-moved-my-data)

arshad01 commented 5 years ago

I have done a fix for this issue. Please check if it is correct. Thanks.

Note: I could not run the unit tests successfully even without this change.

$ git diff
diff --git a/pdfrw/pdfreader.py b/pdfrw/pdfreader.py
index c2ae030..621fff4 100644
--- a/pdfrw/pdfreader.py
+++ b/pdfrw/pdfreader.py
@@ -614,8 +614,8 @@ class PdfReader(PdfDict):
             # Find all the xref tables/streams, and
             # then deal with them backwards.
             xref_list = []
+            source.obj_offsets = {}
             while 1:
-                source.obj_offsets = {}
                 trailer, is_stream = self.parsexref(source)
                 prev = trailer.Prev
                 if prev is None: