pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.52k stars 3.03k forks source link

Error when multiple repositories provide the same package #11784

Open dstufft opened 1 year ago

dstufft commented 1 year ago

What's the problem this feature will solve?

There's a long standing class of attacks that are typically called "dependency confusion" attacks, which roughly boil down to an individual expected to get package A, but instead they got B. In Python, this almost always happens due to the end user having configured multiple repositories, where they expect package A to come from repository X, but someone is able to publish something named package A at repository Y as well.

Traditionally this takes the form that someone has a private repository for only internal packages, but they also want to use PyPI as a fallback for anything that comes from the wider ecosystem, then someone comes along and registers one or more of their internal packages on PyPI and publishes their own code to it. This causes pip to effectively "merge" these two repositories and view them both as equally authoritative on package A.

Describe the solution you'd like

A key thing to notice here, is that dependency confusion depends on project A being expected to come from repository X, but really it ends up coming from repository Y, which almost always means that pip sees that A coming from both X and Y.

Thus, I suggest we "solve" dependency confusion attacks, and have pip, prior to doing any other filtering like for wheel compatibility, etc, determine if the collected links for a particular project only come from a single repository or if they come from multiple, and IF it's discovered links from multiple repositories, then it would generate an error and refuse to proceed.

Note: It may make sense to de-duplicate the URLs in cases where the URLs have a a hash, and the hashes match between multiple repositories, so that files where the exact same files exist on all repositories are still OK, it's just cases where they have different files.

We may also want some way to indicate that a particular package should opt out of this, or to target a specific repository for a specific packages, but maybe we don't? I can think of a few options:

  1. Don't do anything, tell people that for safety they'll need an index server that provides options to combine multiple repositories in a safe way.
  2. Put the check for files coming from both repositories after filtering out files that don't match our given hashes in hash checking mode, thus people who want to handle this without another index, can switch to hash checking instead.
  3. Provide an option to map a specific package to a specific repository.

Obviously we would need to phase this in over time, presumably by having it generate warnings at first that you can upgrade to errors, then errors that you can downgrade to warnings, then finally only errors (sans any choices we pick to allow people to select the repository they want to use for a specific package).

Alternative Solutions

  1. Push people to use custom repositories - This solution generally works, but the dependency confusion attack can occur anytime the end user has multiple repositories, so you would need a custom repository for each unique set of repositories (which basically boils down to each user needing to run their own) which is a lot of wasted effort. The more damning thing for this idea is that it's opt in, which means we can't protect users by default, so most people would remain unprotected.
  2. Push people to use hash checking mode - This basically has the same answer as (1), it's opt in so it leaves most people unprotected.
  3. Just implement the mechanism to map specific packages to specific repository - Again, this is an opt in thing, so it leaves most people unprotected.

Additional context

This idea came out of the Proposal: Preventing dependency confusion attacks with the map file thread on discuss.p.o, while a lot of discussion happened there talking about different strategies that someone could use to protect themselves from dependency confusion attacks. In that discussion, it occurred to me that all of these strategies require the end user to opt in to the protection, but ideally we want something that can happen by default, thus it dawned on me that the core problem comes from pip effectively merging two repositories... so pip could just not do that.

Code of Conduct

dstufft commented 1 year ago

I'm posting this here and to discuss.p.o just so nobody misses it.

It's been about 10 days since I posted my proposal and other than a few questions I haven't seen anyone raise a objection to the overall idea, and previously folks had seemed on board with the idea (the longer proposal designed to make sure everyone was on the same page and to make it easier for people to jump in without having to read both threads in their entirety).

Given nobody has objected, I'm going to take that as a sign that it's worthwhile to take this to a PEP, so I'll go ahead and start working on that. I plan to focus that PEP around the changes to the repository protocol and what those implications are for installers, I will likely include a non normative recommendations for installers that provide some high level guidance to installers to match the rough behavior in the proposal though, but I won't spell out specific UX for installers.

matteius commented 1 year ago

I am a big fan of:

Provide an option to map a specific package to a specific repository.

which is essentially what the index_lookup patch that I provided in that other PR does (I since have extended it slightly within pipenv) but essentially, I want the ability to say in a resolve and in an install phase to pull packages from specific indexes. I believe extending it the requirements.txt where each package could supply its own index line that is then used to buildup the index_lookup to pass into search_scopes could solve this problem and make the code from the referenced PR actually usable by pip's public interface which if I recall was the primary objection when I first opened that change.

Sorry, I'll have to do more reading to catch up on everyone's position here, but that is my two cents I wanted to share since pipenv is currently using that patch, and we want to get to a point where we aren't patching pip.