pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.25k stars 625 forks source link

Pants lockfile generation includes un-used dists and thus un-vetted dists. #12458

Open jsirois opened 3 years ago

jsirois commented 3 years ago

Pants uses pip-compile to generate lockfiles and pip-compile includes - apparently - all dists for a given version of a requirement whether they were actually used in the resolve or not. This is a security and stability problem.

For example, if the original requirement is foo>=1.0.0 and the IC is CPython==3.7.* the lockfile might contain foo==1.0.0 with hashes for the sdist and the cp37m wheels. Say the lockfile was generated with and later tested with a CPython 3.7 interpreter with the pymalloc extension; so the wheel is what is actually resolved and tested. The sdists will then go unused and untested. As such a CPython 3.7 without the pymalloc extension lockfile consumer will resolve and build the sdist - potentially years later. In normal cases the sdist will be faithful to the cp37m wheel and generate a cp37 wheel that is ~equivalent. Even so; there is no guaranty this wheel behaves the same - there may be code that is conditional upon the pymalloc extension presence and be buggy in the non-presence branch. Worse - the wheels could be a honeypot and the sdist the trap (see: https://docs.google.com/document/d/17Y6_YjLqv6hY5APWLaRw5CMvr_FVDBIWK6WS57mioNg/edit?usp=sharing for related concerns).

It appears to be the case that poetry.lock and Pipfile.lock have the same issue.

Eric-Arellano commented 3 years ago

(From Slack):

pip-compile/Poetry/Pipenv are making a tradeoff. You generate on a single platform, but may use that lockfile on others.

If I generated a lockfile with pantsbuild.pants wheel on my M1, it would only include the hash for the macos_arm64 wheel, even though people using the lockfile need the other wheels' hashes.

Trading off some security for flexibility to use the lockfile on multiple platforms.

@benjyw went on to point out:

Lockfiles only help with supply chain attacks in that you're guaranteed to resolve the same thing every time. There is an implicit assumption that you've vetted that one thing to be safe.

--

I suspect for the vast majority of users they will prefer the flexibility of the resolve working on multiple platforms over the risk of worse security/stability. We know many Pants users use macOS for desktop use and Linux for CI, for example. I'm assuming this "issue"/"feature" has continued to work well for Pipenv/Poetry/pip-compile, implied by their continued usage of it.

Fwit, organizations who take supply chain attacks really seriously can hand edit the requirements.txt-style lockfile Pants generates to remove hashes for release files they do not trust. If that ends up being important to an org, we could perhaps look into helping to automate that as a followup.

jsirois commented 3 years ago

I think you're right, but I'm also not convinced the current state of the art is the best way to trade off. Another way would be to be able to generate partial lockfiles and then merge them. Not much less convenient, but more secure. This all assumes the current state of the art is not doing the wrong thing and just assuming that if a resolve computes req1==X, then all dists for X should be included. Environment markers can make that assumption invalid.

benjyw commented 3 years ago

The lockfile does at least limit the window in which you can be exposed to an attack, namely the time you generate the lockfile. Assuming that happened when PyPI was uncompromised, your resolves will continue to be safe until the next lockfile change. That is not nothing.

Eric-Arellano commented 3 years ago

@chrisjrn proposed an idea that could allow us to do this if we go with Poetry 🙌 In between running poetry lock and poetry export, we can post-process poetry.lock and mutate it how we want, like deleting requirements out-right (e.g. Windows-specific deps) and removing specific hashes.

--

For hashes, poetry.lock stores like this:

[metadata.files]
attrs = [
    {file = "attrs-21.2.0-py2.py3-none-any.whl", hash = "sha256:149e90d6d8ac20db7a955ad60cf0e6881a3f20d37096140088356da6c716b0b1"},
    {file = "attrs-21.2.0.tar.gz", hash = "sha256:ef6aaac3ca6cd92904cdd0d83f629a15f18053ec84e6432106f7a4d04ae4f5fb"},
]
colorama = [
    {file = "colorama-0.4.4-py2.py3-none-any.whl", hash = "sha256:9f47eda37229f68eee03b24b9748937c7dc3868f906e8ba69fbcbdd3bc5dc3e2"},
    {file = "colorama-0.4.4.tar.gz", hash = "sha256:5941b2b48a20143d2267e95b1c2a7603ce057ee39fd88e7329b0c292aa16869b"},
]

I've confirmed that if you delete an entry from the list for a particular dep, then run poetry export, it will not include that hash.

Presumably, we could inspect the file names to determine if any are unused?

--

You can simply delete a dep from poetry.lock for it to not show up in requirements.txt. But it's more involved to determine if it's platform-specific.

For example, Pytest depends on atomicwrites, but only on Windows. this is the atomicwrites entry, w/o any reference to Windows:

[[package]]
name = "atomicwrites"
version = "1.4.0"
description = "Atomic file writes."
category = "dev"
optional = false
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"

Instead, the platform information comes from Pytest's entry:

[[package]]
name = "pytest"
version = "5.4.3"
description = "pytest: simple powerful testing with Python"
category = "dev"
optional = false
python-versions = ">=3.5"

[package.dependencies]
atomicwrites = {version = ">=1.0", markers = "sys_platform == \"win32\""}
attrs = ">=17.4.0"
colorama = {version = "*", markers = "sys_platform == \"win32\""}
more-itertools = ">=4.0.0"
packaging = "*"
pluggy = ">=0.12,<1.0"
py = ">=1.5.0"
wcwidth = "*"

We'd need to know how to evaluate the marker and then follow the dep chain to recursively remove all un-used dists, iiuc.

Another possibility could be modifying lockfile.txt itself via inspection of markers. I think this would allow us to not have to recursively follow deps, only remove any deps with unused env markers.

atomicwrites==1.4.0; python_version >= "3.5" and python_full_version < "3.0.0" and sys_platform == "win32" or sys_platform == "win32" and python_version >= "3.5" and python_full_version >= "3.4.0" \
    --hash=sha256:6d1784dea7c0c8d4a5172b6c620f40b6e4cbfdf96d783691f2e1302a7b88e197 \
    --hash=sha256:ae70396ad1a434f9c7046fd2dd196fc04b12f9e91ffb859164193be8b6168a7a

--

Of course, we would also need to have figured out which platforms/environments are going to be used. This is easy when platforms are specified, but not as clear for a normal resolve. For non-platform resolves, we could ban all Windows deps given that Pants doesn't work on Windows, for example? Allow users to specify what OSes they plan to use perhaps?

jsirois commented 9 months ago

Long overdue follow up, but sloppy thinking here:

The lockfile does at least limit the window in which you can be exposed to an attack, namely the time you generate the lockfile.

Definitely not true. If an sdist is locked and the sdist build system has any floating direct deps or transitive deps (~extremely common), you are not safe from a future injection if the unlocked build system or its dependencies releases new versions.

Towards that end though and also re:

@chrisjrn proposed an idea that could allow us to do this if we go with Poetry 🙌 In between running poetry lock and poetry export, we can post-process poetry.lock and mutate it how we want, like deleting requirements out-right (e.g. Windows-specific deps) and removing specific hashes.

You can use --style universal --no-build to get a universal lock with no sdists, and, belatedly, you can use --target-system to block out Windows dists (Pants does this already).