pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.85k stars 925 forks source link

Removing Pycryptodome Dependency: Massive bloat (extra 260 MB) #429

Closed corneliusroemer closed 4 years ago

corneliusroemer commented 4 years ago

Feature request

Trying to replace PyPDF2 with pdfminer.six for PDF text extraction in an Alpine Linux Python base container (python:alpine3.7) I hit a significant roadblock because of pdfminer's dependency on pycryptodome.

Pycryptodome installation requires gcc and a whole load of other dependencies that aren't present in the light base container (90 MB). It took me a while to figure out how to circumvent this problem, which is another reason to try to get rid of the dependency on pycryptodome (https://stackoverflow.com/a/35713086/7483211).

Installation of apk add gcc g++ make libffi-dev openssl-dev (requirements for pycryptodome) bloats the image from a mere 90MB to 270 MB.

Then, installing pdfminer.six on top bloats to 351MB. So just for pdfminer.six to be installed almost quadruples storage space.

PyPDF2 didn't really any noticeable extra storage.

I'm not sure what pycryptodome is being used for, if it's not mission-critical, it may be helpful to add it as an optional module, to be installed when actually used, rather than having it ship by default. I guess most people never need it.

I'm also not sure why the pdfminer.six installation is so big itself (seems to be around 80MB in my case). Maybe this can also be optimised further.

pietermarsman commented 4 years ago

Hi @corneliusroemer, thanks for this issue!

I'm focussing on the pycryptodome stuff in this response. If you experience the install size of pdfminer.six as a problem, then you should create a separate issue for that.

On the pypi page of pycryptodome it is explicitly mentioned that it is a self-contained package:

PyCryptodome is a self-contained Python package of low-level cryptographic primitives.

Do you know why you need to install these c libraries then?

Impact As far as I can tell the Crypto package (that is installed if you install pycryptodome) is only used in pdfdocument.py. It imports 3 different algorithms: SHA256, ARC4 and AES. These are used for decryption of the document (e.g. when it is locked with a password).

The SHA256 and AES algorithms are only used when they are successfully imported. And the ARC4 has a backup implementation in arcfour.py. So it looks like pdfminer.six already supports running without pycryptodome.

SHA256 can be replaced by hashlib.sha256. AES can be replaced with cryptography.fernet.Fernet or with rijndael.py (in pdfminer.six but never used).

Things we could do

  1. Not install pycryptodome by default and raise an error that explains the situation when it must be used.
    • Easy since pdfminer.six already asumes that the package is optional
  2. Replace pycryptodome with other (pure python?) implementations of the encryption algorithms.
    • We have test pdf's with aes and arc4 encryption so it's easy to test.
    • I prefer to using the standard most commonly used encryption package for python. Not sure if that is pycryptodome or cryptography.
lithiumFlower commented 4 years ago

Do you know why you need to install these c libraries then?

He has to install the compiler toolchains and dependent C libs because he's using Alpine linux. Most linux distributions use the GNU version of the standard C library (glibc). Alpine linux uses musl. So any pre-compiled C code (pycryptodome C extensions in a wheel package) that links against glibc won't work there.

The result is that pycryptodome's C extensions must be compiled according to the rules of its setup.py. I'm assuming that's what is using the tools he mentioned he had to install.

All that may be the case, but most python libraries with C extensions are going to have this problem. It's just the pain of using Alpine. Some however have precompiled packages specifically for Alpine (like cryptography).

pietermarsman commented 4 years ago

Proposal:

This removes any dependency on pycryptome and adds a dependency on cryptography.

lithiumFlower commented 4 years ago

@pietermarsman That seems like the appropriate approach if you don't want to support AES encrypted pdfs by default.

I would suggest moving the arc4 routines to use cryptography as well, and to remove the arcfour.py module. This would be faster and more consistent. I can do this in my associated PR if you want.

W.r.t arcfour; Normally there would be a security concern of using your self-implemented version of the encryption routine but you're only decrypting. There are still edge cases like an attacker that has access to the original encrypted document and attackable side-channel access to the machine doing decryption. Prima facia it doesn't seem like a big deal, but my advice would be along the lines of "better safe than sorry" with encryption.

corneliusroemer commented 4 years ago

Some clarification surrounding my claim of 260MB bloat when installing pdfminer: https://github.com/pdfminer/pdfminer.six/pull/456#issuecomment-657543648