This tarball though is not verified against a signature, or a hash. In the event of a modified MuPDF tarball, either maliciously or unintentionally, this will lead to non-reproducible PyMuPDF builds, or downright unsafe ones.
Describe the solution you'd like
It would be a nice improvement to take advantage of the SHA-1 hashes in the MuPDF downloads page. This way, we could ensure proper reproducibility, and security against supply chain attacks.
We can further improve here by using SHA-256 hashes (since SHA-1 is considered unsafe), or using PGP signatures.
Describe alternatives you've considered
Users can:
Download the MuPDF source locally.
Check it against the SHA-1 hash in the website.
Build the PyMuPDF source using the PYMUPDF_SETUP_MUPDF_TGZ envvar.
This approach has several drawbacks though:
Environment flags defeat the purpose of reproducibility. A stale envvar means that PyMuPDF will build against an older MuPDF source, and users will most likely not notice it.
Checking the SHA-1 hash from their browser before building a package is a weak defense mechanism in the case of a compromised site. If the contents of the tarball can change, so can the advertised SHA-1 in the same page.
It interrupts the common poetry lock -> poetry install (or equivalent) flow that is part of modern Python development.
Is your feature request related to a problem? Please describe.
When building PyMuPDF from source, the default behavior is to download the MuPDF source tarball from the Internet:
https://github.com/pymupdf/PyMuPDF/blob/e6e1daa0e14cb08e15f1caab2d6bc794276d6909/setup.py#L389
This tarball though is not verified against a signature, or a hash. In the event of a modified MuPDF tarball, either maliciously or unintentionally, this will lead to non-reproducible PyMuPDF builds, or downright unsafe ones.
Describe the solution you'd like
It would be a nice improvement to take advantage of the SHA-1 hashes in the MuPDF downloads page. This way, we could ensure proper reproducibility, and security against supply chain attacks.
We can further improve here by using SHA-256 hashes (since SHA-1 is considered unsafe), or using PGP signatures.
Describe alternatives you've considered
Users can:
PYMUPDF_SETUP_MUPDF_TGZ
envvar.This approach has several drawbacks though:
poetry lock
->poetry install
(or equivalent) flow that is part of modern Python development.