jku commented 4 years ago

What's the problem this feature will solve?

protect users of PyPI from compromises of the integrity, consistency, and freshness properties of PyPI packages, and enhance compromise resilience by mitigating key risk and providing mechanisms to recover from a compromise of PyPI or its signing keys

This will allow pip to be more secure against attacks on PyPI mirrors and PyPI's content distribution network. The implementation ("the minimum security model") supports verification of PyPI distributions that are signed with keys stored on PyPI, but the pip client implementation should just continue working if/when Warehouse moves to the "maximum security model" (PEP-480) where both PyPI and the developers sign distributions.

original discussion on the PEP: https://discuss.python.org/t/pep-458-secure-pypi-downloads-with-package-signing/2648

Describe the solution you'd like

pip should use TUF reference client library to secure downloads from pypi.org (and 3rd party Warehouse instances that support TUF). This should happen without affecting the user experience in any major way (except of course in the event of TUF preventing downloads for security reasons). The implementation should allow using both TUF-enabled and non-TUF-enabled repositories at the same time: no existing functionality should break.

More information:

Additional context

I'm currently planning how to do this and am prepared to work on the actual implementation. The current state of things is:

the Warehouse implementation is being worked on by William Woodruff: it is not complete and not merged
I think I have identified the core issues that need to be solved and I've tested most of them in a throw-away draft implementation
there are still open issues with regards to the exact API promise that Warehouse will have to make and the actual TUF configuration that is required: so the pip implementation should not be rushed
I intend to document the identified issues and general design here before doing any major coding (I'm hoping for the main issues to get resolved first but I will document in any case). With summer holidays at hand, this will probably be in August
The warehouse implementation has been discussed in Zulipchat on #pep458-implementation (https://python.zulipchat.com/#narrow/stream/223926-pep458-implementation) and their biweekly meeting.

chrahunt commented 4 years ago

The reference implementation for a tuf client transitively depends on a handful of libraries that aren't pure Python (cryptography, for example). Our current vendoring policy states:

Vendored libraries MUST function without any build steps such as 2to3 or compilation of C code, practically this limits to single source 2.x/3.x and pure Python.

No comment on how to address this, I just wanted to call it out since it seems like the biggest potential roadblock to me.

jku commented 4 years ago

I've reviewed the dependencies and I believe the issue is not as bad: The crypto choices made in Warehouse mean that the client should not need cryptography or pynacl libraries. My understanding is that currently pip would need to vendor only tuf, securesystemslib and maybe iso8601 as new dependencies (iso8601 is tiny but we may be able to drop it as well: https://github.com/theupdateframework/tuf/issues/1065)

EDIT: I'll doublecheck this just to be sure later this week ;)

jku commented 4 years ago

Could not leave this nagging in the back of my head: I've now double checked the transitive dependencies. Warehouse is planning to use ed25519 signatures which can be verified using the python implementation included in securesystemslib (a vendored https://github.com/pyca/ed25519/). No C dependencies are required.

jku commented 4 years ago

A bit more detail on how I plan to implement this.

Summary

Vendored TUF (https://github.com/theupdateframework/tuf/tree/develop/tuf) reference implementation client library will do the hard work
pip users will be protected from an attacker able to take over some pypi infrastructure
pip users will also get a reliable snapshot view of the repository: everything installed with one pip command will be from a specific repository snapshot (repository cannot change during the process)
repositories that do not support TUF will be supported as before (even mixed with TUF-enabled repositories)
pip user experience does not change in significant ways (but see the Open Issues section)

Status

I have a WIP design doc, feel free to have a look but please note that it is mostly a reference for myself. I also have a pip fork with some filed issues to keep track of things and some preliminary code: I intend to work there until I have something that can actually be tested. At that point I'll come back and get some early review -- that should be a good time to find any absolute blockers. I believe I have identified all of the missing pieces and issues with my plans (and the Warehouse plans): not all of these issues have been 100% solved yet, see below. The Warehouse implementation is not yet ready (it does not support index file verification) so unfortunately very little can be properly/easily tested right now. If someone wants to test this or work on it: let me know and I'll show my test setup.

Example package install flow Everything in italics is unchanged from current pip

pip is started with "pip install sampleproject"
based on repository configuration and the input, resolver decides it needs sampleproject index file https://pypi.org/simple/sampleproject
Local TUF metadata is checked: if we have metadata for repository "https://pypi.org/simple/", TUF is used (otherwise proceed as before):
- TUF metadata snapshot is updated from repository -- but only once during a pip invocation. New metadata is verified using the old local metadata
- TUF is used to download target sampleproject (in practice this is a version specified in metadata, e.g. https://pypi.org/simple/7a8bc0d0a15f6289b184f86a56d398467fd465c6293535182ba0f6cc2f04e703.sampleproject). The downloaded file is verified with metadata
index file contents are processed, a distribution URL is found for download: "https://files.pythonhosted.org/packages/b8/f7/dd9223b39f683690c30f759c876df0944815e47b588cb517e4b9e652bcf7/sampleproject-2.0.0-py3-none-any.whl#sha256=2b0c55537193b792098977fdb62f0acbaeb2c3cfc56d0e24ccab775201462e04"
Local TUF metadata is checked: if we have metadata for the repository that this distribution URL came from, TUF is used (otherwise proceed as before):
- the distribution target name is parsed from URL: b8/f7/dd9223b39f683690c30f759c876df0944815e47b588cb517e4b9e652bcf7/sampleproject-2.0.0-py3-none-any.whl in the example
- TUF is used to download the target from the distribution mirror
  - internally TUF first updates the distribution metadata bin for this target from repository (this is done only now to avoid downloading metadata that is not needed)
  - TUF downloads the actual file from the distribution mirror and to verifies it with the metadata

Let me know if you'd like more details or examples.

Other details

the local metadata will be stored in user data dir, e.g. on linux $HOME/.local/share/pip/tuf-repositories/HASH/, where HASH is a hash of canonicalized repository url (e.g. "https://pypi.org/simple/")
initial metadata for pypi.org will be shipped with pip
There is no user configuration: repositories will be TUF-enabled if local metadata is found
Some sort of --disable-repository-security switch might be needed
download cache (if one is needed, this needs some discussion) will be in user cache dir, e.g. on linux $HOME/.cache/pip/tuf/ -- the existing caches are not really usable for the purpose (but see also the issue about whether TUF should download files or not)

Known Open Issues

There are a few more issues in https://github.com/jku/pip/issues/ but these are the main ones I believe:

TUF downloads files (and does not just verify them): this is a visible issue as TUF currently provides no progress indication, but it might be an issue otherwise too if pip really wants low level control of the http request or the cache. If this is a show-stopper it should be discussed with (TUF community if needed) ASAP to find a workable solution. I plan to implement the current API in any case: it'll be useful even if not mergeable https://github.com/jku/pip/issues/9
Resolving the distribution target name from URL requires knowledge of the hash implementation or the server configuration https://github.com/jku/pip/issues/6
Method to find Warehouse TUF metadata location is not yet settled https://github.com/jku/pip/issues/5

pfmoore commented 4 years ago

TUF downloads files (and does not just verify them): this is a visible issue as TUF currently provides no progress indication, but it might be an issue otherwise too if pip really wants low level control of the http request or the cache

We do have download progress bars, so I think we need to consider that question. While I think "show stopper" is a bit strong, I think that pip just sitting there for potentially quite a long time (some packages are huge) with no indication of progress, is far from ideal, and we wouldn't want to permanently block any chance of having some sort of "downloading, xx% complete" progress indicator.

Also, how does this integrate with the existing work on parallel downloads and partial downloads of wheels to extract metadata? That sounds like pip needing low-level control of the HTTP request.

This seems like it's turning file downloads into a black box that pip has no control over, which doesn't seem ideal given that's where we are currently spending a lot of development effort... I suggest you look through the tracker and make sure you are involved in discussions on any items that might be affected by this work. I'm a bit concerned that you didn't find these pieces of work already...

Resolving the distribution target name from URL requires knowledge of the hash implementation or the server configuration

I have no idea what that means, so I can't tell if it's important for pip. But we do rely on being able to parse the URL for a package we read from the index (to get the project name and version, and in the case of wheels the compatibility tags). If you're saying we can no longer do that, then this is a major issue.

McSinyx commented 4 years ago

I'm cross-linking pypa/warehouse#8254 while trying to understand this thread :stuck_out_tongue:

jku commented 4 years ago

We do have download progress bars, so I think we need to consider that question. While I think "show stopper" is a bit strong, I think that pip just sitting there for potentially quite a long time (some packages are huge) with no indication of progress, is far from ideal

Agreed, and adding this to a future TUF release is almost certainly not an issue.

Also, how does this integrate with the existing work on parallel downloads and partial downloads of wheels to extract metadata? That sounds like pip needing low-level control of the HTTP request.

On the issue of extracting dependency metadata from wheels: I've seen Warehouse folks discussing this (and providing the metadata separately from the wheels) with regards to TUF but cannot remember right now where that happened -- I will try to find it.

Parallel downloads are not supported by TUF at the moment but I do not see why they could not be in the future.

This seems like it's turning file downloads into a black box that pip has no control over, which doesn't seem ideal given that's where we are currently spending a lot of development effort.

This is indeed the most important question -- if low level control of http requests is something pip does want to do then all the previous issues are moot: we should instead talk to TUF folks about providing alternative API for this use case.

This does not have to be decided right now (I'll do an implementation with current API in any case) but the discussion can be started.

I suggest you look through the tracker and make sure you are involved in discussions on any items that might be affected by this work. I'm a bit concerned that you didn't find these pieces of work already...

I'm a simple guy and can only keep track of so many moving parts in my head. I've been concentrating on the TUF and Warehouse tuf implementation details so far, trying to make sure an implementation is even possible.

Resolving the distribution target name from URL requires knowledge of the hash implementation or the server configuration

I have no idea what that means, so I can't tell if it's important for pip. But we do rely on being able to parse the URL for a package we read from the index (to get the project name and version, and in the case of wheels the compatibility tags). If you're saying we can no longer do that, then this is a major issue.

No, I don't think this is an issue for pip. The index file contents do not change: the issue is that we need to split the distribution URL into distribution mirror base url and target name recognised by TUF: in the example these would be https://files.pythonhosted.org/packages/ and b8/f7/dd9223b39f683690c30f759c876df0944815e47b588cb517e4b9e652bcf7/sampleproject-2.0.0-py3-none-any.whl respectively. To do that split we need a little bit of information (e.g. that the target name has enough directories to form a blake2 hash): the Warehouse folks did not see this as an issue so far.

The reason I mention it is that if Warehouse happened to change those details in the future, that could then become very annoying for a client that made the wrong assumptions about the details...

pradyunsg commented 4 years ago

I have no idea what that means, so I can't tell if it's important for pip. But we do rely on being able to parse the URL for a package we read from the index (to get the project name and version, and in the case of wheels the compatibility tags). If you're saying we can no longer do that, then this is a major issue.

(I think) This basically means that we'd have the hash included in the simple index page we look for (which will use information provided from the TUF metadata). Basically something like https://pypi.org/simple/PROJECT.HASH. See https://github.com/pypa/warehouse/issues/8487#issuecomment-683391138 for context.

pradyunsg commented 4 years ago

Parallel downloads are not supported by TUF at the moment but I do not see why they could not be in the future.

TBH, that's not an issue, since I think a much better area to make improvement once we have a TUF-protected PyPI is to give ourselves a JSON API for dependency resolution that's also TUF-protected.

pradyunsg commented 4 years ago

I suggest you look through the tracker and make sure you are involved in discussions on any items that might be affected by this work. I'm a bit concerned that you didn't find these pieces of work already...

See https://github.com/pypa/pip/issues/7819, which has most of the relevant discussions. I think the first post there covers basically everything that we're thinking of doing?

jku commented 4 years ago

https://github.com/pypa/warehouse/issues/8254 is the Warehouse issue about serving wheel metadata separately. This implementation should just work with TUF verification (unlike partial downloads -- I don't see how those could be reasonably verified even to the standard that wheel files are verified right now)