pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.54k stars 447 forks source link

Check the hash of the downloaded MuPDF tarball #3463

Open apyrgio opened 1 month ago

apyrgio commented 1 month ago

Is your feature request related to a problem? Please describe.

When building PyMuPDF from source, the default behavior is to download the MuPDF source tarball from the Internet:

https://github.com/pymupdf/PyMuPDF/blob/e6e1daa0e14cb08e15f1caab2d6bc794276d6909/setup.py#L389

This tarball though is not verified against a signature, or a hash. In the event of a modified MuPDF tarball, either maliciously or unintentionally, this will lead to non-reproducible PyMuPDF builds, or downright unsafe ones.

Describe the solution you'd like

It would be a nice improvement to take advantage of the SHA-1 hashes in the MuPDF downloads page. This way, we could ensure proper reproducibility, and security against supply chain attacks.

We can further improve here by using SHA-256 hashes (since SHA-1 is considered unsafe), or using PGP signatures.

Describe alternatives you've considered

Users can:

  1. Download the MuPDF source locally.
  2. Check it against the SHA-1 hash in the website.
  3. Build the PyMuPDF source using the PYMUPDF_SETUP_MUPDF_TGZ envvar.

This approach has several drawbacks though:

  1. Environment flags defeat the purpose of reproducibility. A stale envvar means that PyMuPDF will build against an older MuPDF source, and users will most likely not notice it.
  2. Checking the SHA-1 hash from their browser before building a package is a weak defense mechanism in the case of a compromised site. If the contents of the tarball can change, so can the advertised SHA-1 in the same page.
  3. It interrupts the common poetry lock -> poetry install (or equivalent) flow that is part of modern Python development.