tree-sitter / py-tree-sitter

Python bindings to the Tree-sitter parsing library
https://tree-sitter.github.io/py-tree-sitter/
MIT License
817 stars 96 forks source link

How to audit binary wheels? #249

Closed Akuli closed 3 months ago

Akuli commented 3 months ago

Given the recent events, I don't want to trust releases of open-source projects.

For example, pip install tree-sitter-python downloads and extracts a wheel that contains many Python files and one binary file _binding.abi3.so. How can I be sure that _binding.abi3.so was built from https://github.com/tree-sitter/tree-sitter-python without backdoors or other "great new features" added?

Here's what I did:

  1. Made a new private repo and pushed contents of tree-sitter-python at tag v0.21.0 there
  2. Using git commit logs, find the latest commit of tree-sitter/workflows at the time of releasing tree-sitter-python v0.21.0
  3. Copy the build steps from that commit of tree-sitter/workflows to my private fork. Disable macos-13 because it no longer works.
  4. Run the build in GitHub Actions on my private fork
  5. Download and extract the wheel file built in GitHub Actions
  6. Compare my binary file and the binary from pypi --> the file contents are exactly the same.

This works, but it's a pain, especially if I use many tree-sitter-foo packages that are all built individually. It also doesn't work if the build used something that no longer exists, e.g. macos-13.

Maybe we could print the hashes of all source files at the start of the build, and print the hash of the binary file at the end of the build? This would make verifying much easier.