pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.47k stars 3.01k forks source link

Implement PEP-458: Secure PyPI downloads with signed repository metadata #8585

Open jku opened 4 years ago

jku commented 4 years ago

What's the problem this feature will solve?

PEP-458:

protect users of PyPI from compromises of the integrity, consistency, and freshness properties of PyPI packages, and enhance compromise resilience by mitigating key risk and providing mechanisms to recover from a compromise of PyPI or its signing keys

This will allow pip to be more secure against attacks on PyPI mirrors and PyPI's content distribution network. The implementation ("the minimum security model") supports verification of PyPI distributions that are signed with keys stored on PyPI, but the pip client implementation should just continue working if/when Warehouse moves to the "maximum security model" (PEP-480) where both PyPI and the developers sign distributions.

original discussion on the PEP: https://discuss.python.org/t/pep-458-secure-pypi-downloads-with-package-signing/2648

Describe the solution you'd like

pip should use TUF reference client library to secure downloads from pypi.org (and 3rd party Warehouse instances that support TUF). This should happen without affecting the user experience in any major way (except of course in the event of TUF preventing downloads for security reasons). The implementation should allow using both TUF-enabled and non-TUF-enabled repositories at the same time: no existing functionality should break.

More information:

Additional context

I'm currently planning how to do this and am prepared to work on the actual implementation. The current state of things is:

chrahunt commented 4 years ago

The reference implementation for a tuf client transitively depends on a handful of libraries that aren't pure Python (cryptography, for example). Our current vendoring policy states:

Vendored libraries MUST function without any build steps such as 2to3 or compilation of C code, practically this limits to single source 2.x/3.x and pure Python.

No comment on how to address this, I just wanted to call it out since it seems like the biggest potential roadblock to me.

jku commented 4 years ago

I've reviewed the dependencies and I believe the issue is not as bad: The crypto choices made in Warehouse mean that the client should not need cryptography or pynacl libraries. My understanding is that currently pip would need to vendor only tuf, securesystemslib and maybe iso8601 as new dependencies (iso8601 is tiny but we may be able to drop it as well: https://github.com/theupdateframework/tuf/issues/1065)

EDIT: I'll doublecheck this just to be sure later this week ;)

jku commented 4 years ago

Could not leave this nagging in the back of my head: I've now double checked the transitive dependencies. Warehouse is planning to use ed25519 signatures which can be verified using the python implementation included in securesystemslib (a vendored https://github.com/pyca/ed25519/). No C dependencies are required.

jku commented 4 years ago

A bit more detail on how I plan to implement this.

Summary

Status

I have a WIP design doc, feel free to have a look but please note that it is mostly a reference for myself. I also have a pip fork with some filed issues to keep track of things and some preliminary code: I intend to work there until I have something that can actually be tested. At that point I'll come back and get some early review -- that should be a good time to find any absolute blockers. I believe I have identified all of the missing pieces and issues with my plans (and the Warehouse plans): not all of these issues have been 100% solved yet, see below. The Warehouse implementation is not yet ready (it does not support index file verification) so unfortunately very little can be properly/easily tested right now. If someone wants to test this or work on it: let me know and I'll show my test setup.

Example package install flow Everything in italics is unchanged from current pip

Let me know if you'd like more details or examples.

Other details

Known Open Issues

There are a few more issues in https://github.com/jku/pip/issues/ but these are the main ones I believe:

pfmoore commented 4 years ago

TUF downloads files (and does not just verify them): this is a visible issue as TUF currently provides no progress indication, but it might be an issue otherwise too if pip really wants low level control of the http request or the cache

We do have download progress bars, so I think we need to consider that question. While I think "show stopper" is a bit strong, I think that pip just sitting there for potentially quite a long time (some packages are huge) with no indication of progress, is far from ideal, and we wouldn't want to permanently block any chance of having some sort of "downloading, xx% complete" progress indicator.

Also, how does this integrate with the existing work on parallel downloads and partial downloads of wheels to extract metadata? That sounds like pip needing low-level control of the HTTP request.

This seems like it's turning file downloads into a black box that pip has no control over, which doesn't seem ideal given that's where we are currently spending a lot of development effort... I suggest you look through the tracker and make sure you are involved in discussions on any items that might be affected by this work. I'm a bit concerned that you didn't find these pieces of work already...

Resolving the distribution target name from URL requires knowledge of the hash implementation or the server configuration

I have no idea what that means, so I can't tell if it's important for pip. But we do rely on being able to parse the URL for a package we read from the index (to get the project name and version, and in the case of wheels the compatibility tags). If you're saying we can no longer do that, then this is a major issue.

McSinyx commented 4 years ago

I'm cross-linking pypa/warehouse#8254 while trying to understand this thread :stuck_out_tongue:

jku commented 4 years ago

We do have download progress bars, so I think we need to consider that question. While I think "show stopper" is a bit strong, I think that pip just sitting there for potentially quite a long time (some packages are huge) with no indication of progress, is far from ideal

Agreed, and adding this to a future TUF release is almost certainly not an issue.

Also, how does this integrate with the existing work on parallel downloads and partial downloads of wheels to extract metadata? That sounds like pip needing low-level control of the HTTP request.

On the issue of extracting dependency metadata from wheels: I've seen Warehouse folks discussing this (and providing the metadata separately from the wheels) with regards to TUF but cannot remember right now where that happened -- I will try to find it.

Parallel downloads are not supported by TUF at the moment but I do not see why they could not be in the future.

This seems like it's turning file downloads into a black box that pip has no control over, which doesn't seem ideal given that's where we are currently spending a lot of development effort.

This is indeed the most important question -- if low level control of http requests is something pip does want to do then all the previous issues are moot: we should instead talk to TUF folks about providing alternative API for this use case.

This does not have to be decided right now (I'll do an implementation with current API in any case) but the discussion can be started.

I suggest you look through the tracker and make sure you are involved in discussions on any items that might be affected by this work. I'm a bit concerned that you didn't find these pieces of work already...

I'm a simple guy and can only keep track of so many moving parts in my head. I've been concentrating on the TUF and Warehouse tuf implementation details so far, trying to make sure an implementation is even possible.

Resolving the distribution target name from URL requires knowledge of the hash implementation or the server configuration

I have no idea what that means, so I can't tell if it's important for pip. But we do rely on being able to parse the URL for a package we read from the index (to get the project name and version, and in the case of wheels the compatibility tags). If you're saying we can no longer do that, then this is a major issue.

No, I don't think this is an issue for pip. The index file contents do not change: the issue is that we need to split the distribution URL into distribution mirror base url and target name recognised by TUF: in the example these would be https://files.pythonhosted.org/packages/ and b8/f7/dd9223b39f683690c30f759c876df0944815e47b588cb517e4b9e652bcf7/sampleproject-2.0.0-py3-none-any.whl respectively. To do that split we need a little bit of information (e.g. that the target name has enough directories to form a blake2 hash): the Warehouse folks did not see this as an issue so far.

The reason I mention it is that if Warehouse happened to change those details in the future, that could then become very annoying for a client that made the wrong assumptions about the details...

pradyunsg commented 4 years ago

I have no idea what that means, so I can't tell if it's important for pip. But we do rely on being able to parse the URL for a package we read from the index (to get the project name and version, and in the case of wheels the compatibility tags). If you're saying we can no longer do that, then this is a major issue.

(I think) This basically means that we'd have the hash included in the simple index page we look for (which will use information provided from the TUF metadata). Basically something like https://pypi.org/simple/PROJECT.HASH. See https://github.com/pypa/warehouse/issues/8487#issuecomment-683391138 for context.

pradyunsg commented 4 years ago

Parallel downloads are not supported by TUF at the moment but I do not see why they could not be in the future.

TBH, that's not an issue, since I think a much better area to make improvement once we have a TUF-protected PyPI is to give ourselves a JSON API for dependency resolution that's also TUF-protected.

pradyunsg commented 4 years ago

I suggest you look through the tracker and make sure you are involved in discussions on any items that might be affected by this work. I'm a bit concerned that you didn't find these pieces of work already...

See https://github.com/pypa/pip/issues/7819, which has most of the relevant discussions. I think the first post there covers basically everything that we're thinking of doing?

jku commented 4 years ago

https://github.com/pypa/warehouse/issues/8254 is the Warehouse issue about serving wheel metadata separately. This implementation should just work with TUF verification (unlike partial downloads -- I don't see how those could be reasonably verified even to the standard that wheel files are verified right now)

dstufft commented 4 years ago

Partial downloads are basically unsupportable with signed data. There are some schemes to make it possible, but those schemes are all super fragile and annoying, and also really subtle in how you use them, which I wouldn't feel comfortable relying on.

Parallel downloads should be fine to implement, there's no security related concerns with that.

One big possible reason for wanting to keep control of the actual downloading in pip, is to support better control of concurrency and concurrency primitives. If the TUF library is a black box that does I/O it makes it difficult to say, move pip to an async io paradigm (not that we currently have any plans to do that), or to ensure that multiple things that are handling concurrency are all cooperating to ensure we maintain certain levels of concurrency and don't exceed those. Other benefits are things like handling authentication, timeouts, etc. There is a lot of stuff there, and it feels like it might be better if TUF exposed APIs such that it didn't expect to "own" the downloading, but instead it could communicate out "hey fetch these URLs", and then pip could fetch them, and then pass the contents back in for verification (and of course, there might be multiple rounds of this to fully verify things.

That's a bit harder to implement, so maybe it's too out of scope, but a set up like that allows a lot more control and reusability I think. It still doesn't allow things like partial downloads (because a key requirement here is that no fetched content is trusted until it's been verified, and we can't verify until we have the entire file).

dstufft commented 4 years ago

This is more on the TUF side ofthings, and honestly it's a non trivial amount of work so it might not be worth it, but what I described above is basically writing the library in the sans-io style. If there's any appetite for doing that, I have some experience with that, so can advise if needed.

pfmoore commented 4 years ago

(I think) This basically means that we'd have the hash included in the simple index page we look for (which will use information provided from the TUF metadata). Basically something like https://pypi.org/simple/PROJECT.HASH. See pypa/warehouse#8487 (comment) for context.

Isn't that in violation of PEP 503, which says "Below the root URL is another URL for each individual project contained within a repository. The format of this URL is // where the is replaced by the normalized name for that project, so a project named "HolyGrail" would have a URL like /holygrail/"?

dstufft commented 4 years ago

Those URLs still work, the implication here is that the TUF PEP updates the simple repository format to have additional URLs, if that repository opts into TUF.

pfmoore commented 4 years ago

So there's an update needed to PEP 503, then? Or is this covered in the TUF PEP(s) (which I haven't followed, so I've no idea what state they are in)?

At a minimum I'd expect there to be a statement somewhere that the PEP 503 URLs and the "other" ones MUST contain the same content... (Sorry, I'm putting my standards hat on now...)

dstufft commented 4 years ago

The other ones are snapshots of the 503 URLs, so they'll contain the same content... at some specific point in time. While the 503 URLs are basically the latest URLs.

pfmoore commented 4 years ago

OK. So all covered in the TUF PEPs then I guess? (I have no idea about snapshot stuff).

My only real concern is the extent to which this is not transparent to the client, so pip maintainers need to know this stuff. Pip's source code already makes my head hurt, and it's only being able to check the standards that keeps me sane sometimes 🙁

(I'm not joking here - I think we have a real problem with pip's codebase becoming so complex that even the core maintainers struggle to follow it. We need to reduce complexity, not increase it).

dstufft commented 4 years ago

Yea it's covered in the TUF PEP, and ideally abstracted away in the TUF client.

jku commented 3 years ago

There is now a draft PR available: I'm hoping for high-level review and maybe opinions on the discussion items listed in the PR.

The PR is from the branch with vendored sources so it's quite massive: there's another branch that only includes the code changes without the vendoring: https://github.com/pypa/pip/compare/master...jku:tuf-mvp

If you think I should make a PR with the code changes only, let me know: I did it this way to show that lint and and unit tests pass (I'm not sure what the HTTPTooManyRequests errors in some of the integration tests are).

pradyunsg commented 3 years ago

(I'm not sure what the HTTPTooManyRequests errors in some of the integration tests are).

9030

di commented 2 years ago

FYI, we now have a roadmap for PEP 458 for PyPI: https://github.com/pypa/warehouse/issues/10672

pradyunsg commented 1 year ago

I've labelled this as blocked, since this needs the entire checklist on https://github.com/pypi/warehouse/issues/10672 to be completed first.