py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.07k stars 1.39k forks source link

Incorrect IV length (it must be 16 bytes long) #1659

Closed christopher5106 closed 1 year ago

christopher5106 commented 1 year ago

I have an error running the following code on PDF document that is encrypted (I had to install PyCryptoDome)

        reader = PdfReader(filepath)
        extracted_text = ""
        for page in reader.pages:
            extracted_text += page.extract_text()

I can't share my PDF file for security reasons.

This is the complete Traceback I see:

  File "venv/lib/python3.8/site-packages/PyPDF2/_page.py", line 1851, in extract_text
    return self._extract_text(
  File "venv/lib/python3.8/site-packages/PyPDF2/_page.py", line 1356, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "venv/lib/python3.8/site-packages/PyPDF2/generic/_data_structures.py", line 867, in __init__
    data += b_(s.get_object().get_data())
  File "venv/lib/python3.8/site-packages/PyPDF2/generic/_base.py", line 259, in get_object
    obj = self.pdf.get_object(self)
  File "venv/lib/python3.8/site-packages/PyPDF2/_reader.py", line 1269, in get_object
    retval = self._encryption.decrypt_object(
  File "venv/lib/python3.8/site-packages/PyPDF2/_encryption.py", line 761, in decrypt_object
    return cf.decrypt_object(obj)
  File "venv/lib/python3.8/site-packages/PyPDF2/_encryption.py", line 185, in decrypt_object
    obj._data = self.stmCrypt.decrypt(obj._data)
  File "venv/lib/python3.8/site-packages/PyPDF2/_encryption.py", line 87, in decrypt
    aes = AES.new(self.key, AES.MODE_CBC, iv)
  File "venv/lib/python3.8/site-packages/Crypto/Cipher/AES.py", line 228, in new
    return _create_cipher(sys.modules[__name__], key, mode, *args, **kwargs)
  File "venv/lib/python3.8/site-packages/Crypto/Cipher/__init__.py", line 79, in _create_cipher
    return modes[mode](factory, **kwargs)
  File "venv/lib/python3.8/site-packages/Crypto/Cipher/_mode_cbc.py", line 287, in _create_cbc_cipher
    raise ValueError("Incorrect IV length (it must be %d bytes long)" %
ValueError: Incorrect IV length (it must be 16 bytes long)

For a reason I don't know, the length of iV at line 87 in venv/lib/python3.8/site-packages/PyPDF2/_encryption.py is zero. Does that make sense to add the line "if len(iv) == 0: return data" to avoid the error ?

Thanks for advance

MartinThoma commented 1 year ago

Can you open the file + decrypt it with any other PDF reader (e.g. the one from Chrome)?

christopher5106 commented 1 year ago

Yes, I can

pubpub-zz commented 1 year ago

@christopher5106 Can you check what is inside data at _encryption.py:L87 when you are getting the issue. I wondering weither the check should not be on len(data)==0

pubpub-zz commented 1 year ago

@exiledkingcc Can you give your opinion ?

christopher5106 commented 1 year ago

I don't understand the question. There is a problem in reasoning. If IV=data[:16] and len(IV)==0, then what do you think is the length of data ?

Anyway, probably I don't have enough imagination. So let's code it

print(type(data),len(data), data)
<class 'bytes'> 32 b'\x15\xd8\xf4\x9f<\x01<Q\x83g\x8c\x12j[|\xc0\x04\xfamU\xed\xec\x10\x10\x8cY&\xd6\xf2\x96\x9e\xb0'
<class 'bytes'> 0 b''
pubpub-zz commented 1 year ago

can you do some test with this change:

        def decrypt(self, data: bytes) -> bytes:
            if len(data)==0:
                return data
            iv = data[:16]
christopher5106 commented 1 year ago

That works fine with this

pubpub-zz commented 1 year ago

Do you agree to propose a PR to complete your contribution?

christopher5106 commented 1 year ago

ok do I need to be added to push it to a branch ?

pubpub-zz commented 1 year ago

Create a branch on your fork, make the changes, commit them and push the branch onto internet. then when you will go to PR web page you should propose to create a PR. 😉

christopher5106 commented 1 year ago

1663

MartinThoma commented 1 year ago

The PR looks good, well done :+1: It's merged into main and will be part of the release tomorrow.

This issue will be fixed in pypdf > 3.4.1

exiledkingcc commented 1 year ago

@exiledkingcc Can you give your opinion ?

looks good to me

mrdschrute commented 1 year ago

I've run into this same error. I think it's because the fix returns from the decrypt function before the iv variable is initialized. I fixed it by checking the variable and initializing it if necessary. Then let the function complete. Not an ideal fix, but it works. The bigger question is why iv = data[:16] doesn't result in 16 bytes.


def decrypt(self, data: bytes) -> bytes:            
            iv = data[:16]
            if len(iv) != 16:
                iv = b"0000000000000000"
            data = data[16:]
            aes = AES.new(self.key, AES.MODE_CBC, iv)
            if len(data) % 16:
                data = pad(data, 16)
            d = aes.decrypt(data)
            if len(d) == 0:
                return d
            else:
                return d[: -d[-1]]
pubpub-zz commented 1 year ago

@mrdschrute your proposal does not seem to include the mod from #1663 : have you updated pypdf to latest version ? please ensure that you've move from PyPDF2 to pypdf. can you report weither pypdf 3.6.0 fixes or not your issue?

mrdschrute commented 1 year ago

@mrdschrute your proposal does not seem to include the mod from #1663 : have you updated pypdf to latest version ? please ensure that you've move from PyPDF2 to pypdf. can you report weither pypdf 3.6.0 fixes or not your issue?

Yes, I am using pypdf 3.6.0. I removed the mod from #1663 and replaced it as shown to get it working. As I mentioned, it's probably not the best solution. The decrypt function seems to be called repeatedly while processing a pdf. The data length during those calls is occasionally less then 16, causing the issue.

pubpub-zz commented 1 year ago

🤔 by any chance, do you have a document you can share ? you may use @MartinThoma info@martin-thoma.de if you want to keep some privacy

mrdschrute commented 1 year ago

I thought you might ask that. Unfortunately, the document is sensitive (and is not mine), so I can't share it. I also did not create it, so I can't detail that process. I realize that's all bad news. The encrypted version works fine in adobe and the fix I described above does remove the encryption.

pubpub-zz commented 1 year ago

@exiledkingcc any understanding why it did not work?

mrdschrute commented 1 year ago

I realized that I may have not done the best job describing this. I think the problem occurs when len(data) is between 0 and 16. In that case, #1663 does not trigger because the length is not zero, but the code later on is expecting a 16 byte IV and gets disappointed.

pubpub-zz commented 1 year ago

I realized that I may have not done the best job describing this. I think the problem occurs when len(data) is between 0 and 16. In that case, #1663 does not trigger because the length is not zero, but the code later on is expecting a 16 byte IV and gets disappointed.

@exiledkingcc How would you process data which lenght is between 1 to 15 ?

pubpub-zz commented 1 year ago

I realized that I may have not done the best job describing this. I think the problem occurs when len(data) is between 0 and 16. In that case, #1663 does not trigger because the length is not zero, but the code later on is expecting a 16 byte IV and gets disappointed.

@exiledkingcc How would you process data which lenght is between 1 to 15 ?

@exiledkingcc, +1 ?

exiledkingcc commented 1 year ago

AES is block cipher, which always processes one block at once, aka 16 bytes. any data should be padded to be multiple of 16 bytes, so it can be encrypted in blocks, and the encrypted result is always multiple of 16 bytes. if any data which is not multiple of 16, it can't be decrypted. usually it means the data is corrupted, or is not AES encrypted.

pubpub-zz commented 1 year ago

@mrdschrute can you confirm that you are facing data lengths different between 1 and 15? can you confirm that the objects are properly processed by well known viewers?

pubpub-zz commented 1 year ago

@mrdschrute +1?

mrdschrute commented 1 year ago

Sorry for the delayed reply. Yes, I was seeing data lengths between 1 and 15. The objects are properly processed by well known viewers. I actually ending up using PyMuPDF, which worked fine. I'll try find some time to recreate the error with a file I can share.

pubpub-zz commented 1 year ago

I close this issue as there is no new input. feel free to ask to reopen it when you will have a test file