pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.47k stars 3.01k forks source link

Error when multiple repositories provide the same package #11784

Open dstufft opened 1 year ago

dstufft commented 1 year ago

What's the problem this feature will solve?

There's a long standing class of attacks that are typically called "dependency confusion" attacks, which roughly boil down to an individual expected to get package A, but instead they got B. In Python, this almost always happens due to the end user having configured multiple repositories, where they expect package A to come from repository X, but someone is able to publish something named package A at repository Y as well.

Traditionally this takes the form that someone has a private repository for only internal packages, but they also want to use PyPI as a fallback for anything that comes from the wider ecosystem, then someone comes along and registers one or more of their internal packages on PyPI and publishes their own code to it. This causes pip to effectively "merge" these two repositories and view them both as equally authoritative on package A.

Describe the solution you'd like

A key thing to notice here, is that dependency confusion depends on project A being expected to come from repository X, but really it ends up coming from repository Y, which almost always means that pip sees that A coming from both X and Y.

Thus, I suggest we "solve" dependency confusion attacks, and have pip, prior to doing any other filtering like for wheel compatibility, etc, determine if the collected links for a particular project only come from a single repository or if they come from multiple, and IF it's discovered links from multiple repositories, then it would generate an error and refuse to proceed.

Note: It may make sense to de-duplicate the URLs in cases where the URLs have a a hash, and the hashes match between multiple repositories, so that files where the exact same files exist on all repositories are still OK, it's just cases where they have different files.

We may also want some way to indicate that a particular package should opt out of this, or to target a specific repository for a specific packages, but maybe we don't? I can think of a few options:

  1. Don't do anything, tell people that for safety they'll need an index server that provides options to combine multiple repositories in a safe way.
  2. Put the check for files coming from both repositories after filtering out files that don't match our given hashes in hash checking mode, thus people who want to handle this without another index, can switch to hash checking instead.
  3. Provide an option to map a specific package to a specific repository.

Obviously we would need to phase this in over time, presumably by having it generate warnings at first that you can upgrade to errors, then errors that you can downgrade to warnings, then finally only errors (sans any choices we pick to allow people to select the repository they want to use for a specific package).

Alternative Solutions

  1. Push people to use custom repositories - This solution generally works, but the dependency confusion attack can occur anytime the end user has multiple repositories, so you would need a custom repository for each unique set of repositories (which basically boils down to each user needing to run their own) which is a lot of wasted effort. The more damning thing for this idea is that it's opt in, which means we can't protect users by default, so most people would remain unprotected.
  2. Push people to use hash checking mode - This basically has the same answer as (1), it's opt in so it leaves most people unprotected.
  3. Just implement the mechanism to map specific packages to specific repository - Again, this is an opt in thing, so it leaves most people unprotected.

Additional context

This idea came out of the Proposal: Preventing dependency confusion attacks with the map file thread on discuss.p.o, while a lot of discussion happened there talking about different strategies that someone could use to protect themselves from dependency confusion attacks. In that discussion, it occurred to me that all of these strategies require the end user to opt in to the protection, but ideally we want something that can happen by default, thus it dawned on me that the core problem comes from pip effectively merging two repositories... so pip could just not do that.

Code of Conduct

zooba commented 1 year ago

An error if any names match would be adequate (if annoying) defence, though you can't take versions into account for that.

It might result in the attack being labelled a DoS, since an attacker could cause errors in their target's build/deployment process by publishing a conflicting name, but I'd take that over something actually exploitable.

Implementing this probably gets most of the way to some kind of index prioritisation (i.e. rather than dumping all entries, dump all except for a certain index) or constraint (for a given package name, dump all links except for what the constraints file says). But this improvement is definitely a step forward regardless.

pradyunsg commented 1 year ago

FWIW, this would be a significant breaking change.

For example, this wouldn't work for well-known workflows like pytorch's --extra-index-url https://download.pytorch.org/whl/cu116.

pfmoore commented 1 year ago

I imagine that --find-links is a common case where you get multiple sources. Someone does pip wheel -w wheelhouse foo, to build foo, then pip install --find-links wheelhouse bar (where bar depends on foo). The finder works in terms of "link collectors", and it's not immediately obvious how to distinguish between --find-links sources and actual indexes.

dstufft commented 1 year ago

FWIW, this would be a significant breaking change.

Sure, but I don't think there is a way to protect against these types of attack by default without it being a breaking change. For instance, the same thing was true when we first started validating and then requiring HTTPS for repositories, In my opinion, it's just something to work through to figure out the best path forward on it.

For example, this wouldn't work for well-known workflows like pytorch's --extra-index-url https://download.pytorch.org/whl/cu116.

There's nothing inherently unsuitable about their workflow that this wouldn't work for. It just means that whatever the solution is for safely opting into multiple repositories, their users would also need to do that. That could be something like they need to proxy all of PyPI to "flatten" the repository so their users don't see multiple repositories any more, it could be some additional configuration.

Not requiring any changes for pytorch is basically saying that we'll never fix dependency confusion attacks, at least by default, because there's no functional difference between what pytorch is doing for their own wheels, and what pytorch did for torchtriton. You can't fix one and not break the other.

I imagine that --find-links is a common case where you get multiple sources. Someone does pip wheel -w wheelhouse foo, to build foo, then pip install --find-links wheelhouse bar (where bar depends on foo). The finder works in terms of "link collectors", and it's not immediately obvious how to distinguish between --find-links sources and actual indexes.

I could imagine not paying attention to local file paths when looking for multiple sources, local file paths feel sufficiently "special" that it doesn't open up a ton of interesting attacks.

dstufft commented 1 year ago

Some related issues: https://github.com/pypa/pip/issues/9715, https://github.com/pypa/pip/issues/8606, https://github.com/pypa/pip/issues/9610, https://github.com/pypa/pip/pull/10964

9715 is interesting in that it's basically the opposite of this, it doesn't allow you to have multiple repositories unless all repositories are a strict subset of the primary repository. In other words, it enforces the idea of a primary repository and then supplementary repositories, whereas this idea enforces the idea that all repositories are on equal footing with each other [^1]. While there seemed to be some positive opinions of #9715, I think that it's actually the somewhat worse solution.

  1. It assumes that every project needs to be registered on PyPI for it's protections, but not every project wants to be registered on PyPI, for a variety of reasons. These types of names are only quasi allowed on PyPI, since they would just be "name squatting", which is something that PEP 541 disallows (though we do sometimes allow it since it's currently one of our only tools to protect against dependency confusion attacks or trademark issues).
  2. It doesn't actually work in practice, because pip only sees files not registered projects, so a project that exists just to squat a name wouldn't show up as existing to pip.
  3. It will silently work for people, while they're actually still insecure. Nothing is stopping me from using an extra index url that just accidentally works because some project was registered years ago and released some files. In fact #9715 would call that "safe", while that person who owned that project had no idea that their registration of that name was keeping some downstream user safe-- and they may potentially come back around and start releasing new versions, hand it off to someone else, or even delete the project completely, making people vulnerable again.
    • This speaks to the fundamental underlying problem with the security that #9715 supposedly provides, it's making the same blanket assumption that causes dependency confusion to begin with, that two projects on different repositories must somehow be related if they share the same name. Though it does improve things by making it less likely that people are using a random unclaimed name (but still does nothing to make sure that the names claimed are equal to each other).

It does have the nice bonus that it doesn't break situations like piwheels and pytorch, but it does mean that every person with a partial, private index is broken. Which would be fine, IMO, if it was actually a secure solution, but (3) to me feels like it disqualifies it as secure completely.


As far as I can tell, the types of proposals for handling this basically boil down to:

  1. Provide some configuration that lets people optionally specify in some fashion [^2] which repository packages should come from.
    • This has the problem that every single one of these options is opt-in, rather than opt-out. So it adds complexity but doesn't actually solve the problem except for people who already know it's a problem and are looking to take action to prevent it.
  2. Remove --extra-index-url and --find-links.
    • Since the entire problem ends up coming from the fact that multiple repositories may have the same project, but owned by different people (or not owned at all until an attacker comes along), removing the ability to have multiple repositories side steps the entire issue.
    • This forces everyone who needs or wants multiple repositories to somehow create a single unified repository for them to use, which sharing these repositories can be difficult as different end users may want a different collection of repositories unified for their use.
  3. Change --extra-index-url so that it is only valid for projects that also exist in the -index-url.
    • This effectively redefines --extra-index-url such that it can't be used for anything but provide additional files to projects that exist in the --index-url.
    • This does not actually protect against dependency confusion attacks in the general case, since it just encodes the current (wrong) behavior that assumes that because the name is taken in --index-url, then it must be owned by the same people in --extra-index-url.
    • It does make dependency confusion attacks harder to pull off, since attackers won't be able to find random unused names on PyPI that someone is using internally.
  4. Change --extra-index-url so that it is only valid for projects that don't exist in --index-url, unless otherwise configured.
    • This effectively redefines --extra-index-url so that, by default, it is only valid for projects not in --index-url.
    • Protects against dependency confusion attacks, by default, in almost all cases [^3].
    • Has a possibility that someone registering a name on PyPI could make your builds randomly start failing.

Personally, I think that we should eliminate anything that doesn't provide security against dependency confusion by default. Users who are aware of dependency confusion attacks and able to mitigate them by themselves already have tools available to themselves to do that through things like repository proxies. The users who I think we should be most worried about, are users who are not as aware or capable of defending against these types of attacks themselves.

In my opinion, we can take (1) off the table as something that will solve this problem by itself (maybe it can be part of the solution). All of the proposed options I've seen for trying to solve this through configuration alone have either been overly complex OR not actually solving the problem once you get 3+ repositories instead of 2. These all feel like advanced user / edge case solutions, which can already fairly easily be handled by a repository proxy.

I think we can also take (3) off the table, because it doesn't actually solve the problem IMO. It just leans into the underlying assumption that causes this problem in the first place, that a name is the same across multiple repositories, but does so in a way that doesn't ensure that those names are the same or will continue to be the same.

That leaves us with (2) and (4), and after spending a fair amount of time thinking about this, I think that these two choices are the only two real choices unless someone comes up with something I'm not thinking of.

Removing --extra-index-url and --find-links definitely fixes the problem from pip's point of view, but I think that the amount of breakage would be far too high. It breaks everything that (4) would break, and then some. It also doesn't really fix the problem from the user's point of view, it just pushes the problem onto some other system, who has all the same options that pip does to solve it [^4].

That leaves me with just erroring when the same project comes from multiple repositories.

Projects like piwheels will likely need to modify themselves so that they serve as a proxy for all of PyPI rather than as a supplemental repository (they wouldn't have to host the files themselves, just the repository API) and ask people to specify them using --index-url instead.

Things like pytorch will likely need to do the same, although maybe there are additional options we could explore to let a project on one repository prove that it's the same name on multiple repositories [^5] which would let that particular case of supplemental repository still work.

Situations where there are two conflicting packages in multiple repositories, you would revert back to (1), but since it "fails closed" users who don't have conflicting packages are protected without needing to do anything, and users who do have conflicting packages only have to worry about the packages that are actually conflicting, not every package that might ever conflict.

Everyone else should just work and be more secure.

[^1]: This matches how pip actually treats them, the only reason --index-url and --extra-index-url are distinct options in the CLI is because pip needed different behavior for additive vs replace when specified. [^2]: Specifics don't matter, it could be repository prioritization, mapping file, whatever. [^3]: If you expect Project A to come from Repository X, but it has 0 files for it and Repository Y has files, then you are still vulnerable. [^4]: Though maybe it has a better interface for defining repositories than pip does, part of the complexity is that we're trying to boil it down to just strings being passed around on the CLI. [^5]: Not a fully fleshed out idea, but you could imagine some metadata in the repository API that lets you put a list of all of the repositories that a particular project controls, and then, if all of the repositories agree on the same list, treat those lists repositories as a non error case. Another idea could be some sort of public key that a project can sign.. .something with on every repository, and as long as the same public keys are used in all repositories, then they're treated equivalent.

dstufft commented 1 year ago

Projects like piwheels will likely need to modify themselves so that they serve as a proxy for all of PyPI rather than as a supplemental repository (they wouldn't have to host the files themselves, just the repository API) and ask people to specify them using --index-url instead.

Just as a note of course, is that depending on how we implement the configuration aspect of this, it's possible that piwheels could remain how it is today, just with some extra configuration to enable them to be allowed to serve those otherwise conflicting packages.

Things like pytorch will likely need to do the same, although maybe there are additional options we could explore to let a project on one repository prove that it's the same name on multiple repositories 5 which would let that particular case of supplemental repository still work.

An interesting thing, since this idea of some way to let a project link it's multiple disparate repositories together, is in a way that kind of ends up taking some inspiration from #9715, except instead of assuming that the same name on two distinct repositories are equivalent to each other, we require them to explicitly state that they are.

trishankatdatadog commented 1 year ago

While I agree with your conclusion that erroring out[^1] when a project is provided by multiple repositories is a good default safeguard, and that (1) is not the simplest solution by itself, I am still not convinced that network proxies can correctly replace (1) for the reasons described here, although perhaps I missed something. In any case, this fail-fast heuristic will go a long way.

[^1]: Let's call it the fail-fast heuristic.

pfmoore commented 1 year ago

Option 5. Attack the actual root of the problem here and re-think what it actually means for two projects to be distinct. If name doesn’t do it, what else do we need? Source is wrong, as PyTorch and piwheels demonstrate. Do we need a global naming scheme?

I don’t think option 5 is actually practical, as it involves far too much upheaval, but I think it is worth remembering that dependency confusion is simply one consequence of the idea that project name uniquely identifies a project globally.

zooba commented 1 year ago

Option 6. Add the notion of a "final fallback index" (probably PyPI), which is only used[^1] if the explicitly listed repositories can't provide any links at all for a given requirement.

The Pytorch issue is completely unavoidable, unfortunately. The best way would have been for Pytorch to mirror all of their dependencies onto their own index and tell users not to reference PyPI at all, and more generally manual curation is the only way to deal with (effectively) typo-squatting.

The core "feature" of dependency confusion is that the targeted package does actually exist on the requested index, and has likely worked successfully in the past (before the attacker created it on PyPI). Our default $work policy is to use a single index proxy (where admins control the fallback policy), but the more nuanced version is to only reference indexes where the namespaces are "trustworthy" - we "know" that nobody with access to the other indexes is going to publish a name conflict.

So explicitly marking one index as "only if no other index provides the package" would suffice for the common case (only one public-access index).

[^1]: Ideally, only queried, to avoid putting names of private packages into PyPI's HTTP logs. But those are sufficiently controlled that I don't think anyone is really that concerned. I have seen people raise concerns about other index providers though.

dstufft commented 1 year ago

Option 5. Attack the actual root of the problem here and re-think what it actually means for two projects to be distinct. If name doesn’t do it, what else do we need? Source is wrong, as PyTorch and piwheels demonstrate. Do we need a global naming scheme?

Yea, I was restricting myself to things that don't require completely changing how we refer to packages, but if we managed to do something like Go where a full url represents our packages, that would also fix it. But It would also likely render the piwheels case impossible.

Option 6. Add the notion of a "final fallback index" (probably PyPI), which is only used if the explicitly listed repositories can't provide any links at all for a given requirement.

This is just a specific instance of (1), which is essentially just providing some configuration that lets people control what repository a specific package comes from. In this case it's doing it implicitly through a "fallback index". Unfortunately this only works if you have 2 repositories, if you have 3+ configured a single fallback isn't enough, and you instead need to provide the full ordering of preference for ordering all the repositories and then pip needs to change it's collecting of links so that it doesn't "fall through" to lower priority repositories.

A consequence of this design is it assumes that the priority of repositories for every package is the same, and more fundamentally, assumes that the only repository that can introduce a dependency confusion is the final one. Certainly PyPI's open registration of package names makes it easier, and most repositories have more control over who can register what name, but it feels kind of... short sighted to assume that PyPI is the only repository that would ever allow open registration like that, which the fallback option implicitly assumes.

zooba commented 1 year ago

Unfortunately this only works if you have 2 repositories, if you have 3+ configured a single fallback isn't enough, and you instead need to provide the full ordering of preference for ordering all the repositories and then pip needs to change it's collecting of links so that it doesn't "fall through" to lower priority repositories.

True, but I think we could also argue that if you need full ordering among more than two repositories, you need a more customisable proxy index. Whereas "all but one index will be pooled, and if that can't provide any links then the index-of-last-resort will be used" is going to be sufficient for many many cases.

Referring to multiple open-registration repositories at the same time is always a risky idea, even with prioritisation. You really do need curation when multiple indexes are involved.

dstufft commented 1 year ago

Right, but the idea in this thread is safe even in the face of multiple open registration repositories. There's no need to sit there and think hard about whether a particular scenario is protected or not, as long as the real package still exists, then it's impossible to execute a dependency confusion attack.

pfmoore commented 1 year ago

Whereas "all but one index will be pooled, and if that can't provide any links then the index-of-last-resort will be used" is going to be sufficient for many many cases.

It feels like there's a disconnect between "we need to provide better tools for people to address dependency confusion attacks" and "we need to make dependency confusion attacks go away for users who aren't aware of them, and so won't address them for themselves". I'm not sure we can do both (except in the simplistic sense that "not having to worry about simple cases" is sort of a better tool for the experts...)

Personally, I'm not as inclined to worry about the complex cases, as they tend to be encountered by people who have complex infrastructures, and if such people are hoping that open source volunteers will fix their problems so they don't have to, I have little sympathy. I'd rather help the people who don't have an infrastructure of their own, and don't have a budget or a security expert of their own.

Right, but the idea in this thread is safe even in the face of multiple open registration repositories

Well, maybe, but as @pradyunsg pointed out, it's a major breaking change, and would hit people using pytorch (and again, probably hit smaller, independent users harder than big organisations). Without the breakage, I doubt we'd still be discussing this, we'd have just implemented it.

(I spent 5 minutes trying to express a vague feeling that "I don't like the fact that when security's involved, not breaking things gets downgraded in importance". I gave up and that single comment is where I'll leave it).

zooba commented 1 year ago

Right, but the idea in this thread is safe even in the face of multiple open registration repositories.

That's a very "security engineer" definition of safe 😄 Right up there with disconnecting the power.

It's safe, but in so many circumstances it's going to be randomly unusable (because someone added their package to the other open repository), and users will end up with guidance to avoid it all, which I predict will be the same guidance as we'd give even without the idea you proposed above.

Given we're so far into dependency confusion being a known thing that we're discussing this in public without fear of attackers taking advantage, not-breaking-things gets to remain a priority. It would be different if we were under embargo and trying to shut down active attacks, but we're well past that now.

dstufft commented 1 year ago

I don't like the fact that when security's involved, not breaking things gets downgraded in importance

Well the thing is that securing something needs to break things right? At a minimum you need to break the workflow that makes the attack possible, and hopefully you can limit that so it only breaks that specific attacking workflow, but often times you cannot because Hyrum's law is real. So every solution to this is breaking some subset of users, the difference is that this idea breaks some subset of users, while also fully solving the problem, rather than breaking users for a half measure [^2].

If you declare that security fixes cannot break things, then you're declaring that you cannot do security fixes.

Well, maybe, but as @pradyunsg pointed out, it's a major breaking change, and would hit people using pytorch (and again, probably hit smaller, independent users harder than big organisations). Without the breakage, I doubt we'd still be discussing this, we'd have just implemented it.

Unfortunately, the pytorch case is exactly the same as the torchtriton case at a fundamental level, there is currently no way to differentiate the "valid" pytorch case from the "invalid" torchtriton case. So as it stands today, you can't protect against the latter without breaking the former, we would need to figure out some way to differentiate between those two cases.

Up thread I posted a rough idea as to how we could enable pytorch to differentiate itself from the torchtriton case. It would require some changes to the repository API, but they're pretty simple changes. It's not worthwhile to add those changes unless there's agreement on the overall idea since in isolation they're pretty pointless.

The key thing that makes pytorch OK, and torchtriton not, is that torchtriton on PyPI wasn't published by the same people as torchtriton on their private repository, while pytorch is.

That idea is basically to encode a list of "alternate repositories" in the repository API response, and allow files to come from multiple repositories in the case that all repositories agree on those alternate repositories [^1]. This basically basically look like:

import requests

alternate_repos = None
urls = ["https://pypi.org/simple/torch/", "https://download.pytorch.org/whl/torch/"]
for url in urls:
    resp = requests.get(url)
    resp.raise_for_status()

    repos = resp.json()["alternate_repos"]
    if alternate_repos is None:
        alternate_repos = set(repos)
    else:
        alternate_repos.intersection_update(repos)

# At this point, if "torch" comes from multiple repositories we would normally error
# but because there are alternate repositories, we look and see if all of our locations
# that torch came from are agreed upon as alternate repo locations, and if they are
# we allow torch to come from any of those, but if torch is found in any repository
# that isn't listed as an alternate repository, then we revert back to generating an
# error.

That gives us all the protection of this idea, while giving situations like torch the ability to not break.

This works because ultimately the problem comes down to when a project comes from multiple repositories, we possibly get into a situation where there are disparate "owners" for a particular name on different repositories. We currently have no way to link those together, but the alt repos idea would let us link those together.

The case it would still break is the piwheels case. Unfortunately I don't think it's possible to fix the piwheels case because I cannot think of a way to distinguish a piwheel wheel from an attacker provided wheel. In both cases you have the dependency coming from multiple repositories, owned by different people. I think the appropriate way for piwheels to function isn't as a supplemental repository, but as a repository proxy, proxying most files to PyPI (again it can even link directly to the files from PyPI), while mixing in their own provided wheels alongside the files provided by PyPI, and have their users configure that as --index-url rather than --extra-index-url.

[^1]: This DOES NOT change where pip will find those files from, just allows a project like pytorch to mark that it also controls this other repository, so that if pip is configured to fetch from there, then it knows those two repository locations are equal to each other. [^2]: It's a little frustrating that folks were fine in #9715 (even being willing to give it an expedited release!) with a change that was just as breaking, but for a different set of users while also being an incomplete solution to the problem, but are pushing back against this idea as being too breaking.

zooba commented 1 year ago
  1. It's a little frustrating that folks were fine in Only pull in files from --extra-index-url if the package exists on --index-url #9715 (even being willing to give it an expedited release!) with a change that was just as breaking, but for a different set of users while also being an incomplete solution to the problem, but are pushing back against this idea as being too breaking.

In my defense, I'd have been pushing back just as hard against that proposal if I'd been aware of it :) The advantage/risk of posting links on Discourse where I'm likely to notice it!

pfmoore commented 1 year ago

In my view: Security fixes can break people trying to exploit security loopholes. Security fixes have some justification for breaking people doing unsupported/undocumented things (it's not a blanket approval, just a recognition that there are priorities involved, and TBH I'm fairly OK with breaking undocumented usages in general). Security fixes don't have carte blanche to break people doing documented, supported things, any more than any other bugfix or change does. The severity of the breach, the frequency of exploits happening, and the amount of breakage, can all alter these equations. There's a reason I described what I have as a "vague feeling" 🙂

Regarding torch vs torchitron, has anyone asked the pytorch people what their view is on the situation (and the impact of this proposal on them)? It seems like if we're using that issue as the impetus for this change, it's only polite to ask them if they like what we're proposing to do in their name...

Also, I don't think you can decouple the breakage from the benefits. Yes, the original solution here might solve the problem completely, but the cost is high. Solving the problem for the majority, at a lower cost, but on the basis that the minority will have to do some of the work of fixing the issue for their specific case, seems to me to be a perfectly valid alternative approach that should be discussed.

And just to be clear here, I don't have a particular "preferred solution". I fully expect that whatever solution gets implemented, the only direct impact on me will be how many "pip broke my workflow" issues I have to read and deal with. And while I'm trying to keep the bigger picture clear in my mind, I'd be lying if I said that wasn't an important influence on my views...

Maybe another part of this discussion should be how we publicise whatever we decide, and prepare users for any breakage? Having a better message than the often-derided "we're doing this because it's good for you" one that people often get, might help with acceptance of the inevitable breakage (at whatever level we decide is justifiable).

pradyunsg commented 1 year ago

It seems like if we're using that issue as the impetus for this change, it's only polite to ask them if they like what we're proposing to do in their name...

I think it's more of a handy + recent + high-profile example of it, rather than the rationale/impetus behind the change.

Besides that, +1 to everything Paul said. :)

dstufft commented 1 year ago

Security fixes can break people trying to exploit security loopholes.

This I guess is where we differ. I don't think there generally is a thing as a "security loophole" that exists distinct from all other workflows. There's just workflows, some of which have properties that mean that attackers can abuse them, but they almost always have non-attackers using them too, but it's almost always impossible to change those properties that attackers can abuse, without also affecting the non-attackers who are using them.

That doesn't mean it's not worthwhile to change those properties, sometimes it is sometimes its not, it depends on how bad the particular scenario is. Security always gets a bad rap though, because in the "happy" case security does ~nothing visible to end users, it doesn't add some exciting new feature that unlocks something that previously wasn't possible, it always exists to narrow down the cases of what's possible to just the ones that are safe.

Yes, the original solution here might solve the problem completely, but the cost is high. Solving the problem for the majority, at a lower cost, but on the basis that the minority will have to do some of the work of fixing the issue for their specific case, seems to me to be a perfectly valid alternative approach that should be discussed.

I don't think any of the other proposals, other than the ones that are wholly opt in, have a meaningfully lower cost TBH. I think they just (largely) break different groups of people OR they break enough people that it's still impactful, but they solve the problem poorly in ways that are hard to define for end users. Like 3 repositories isn't particularly a huge edge case, you can do that with PyPI, Piwheels, and torch, and AFAIK piwheels automatically ingests projects from PyPI, so I think anything that involves piwheels+PyPI needs to treat them both equally as "anyone can claim any name" they want.

I'm trying to mentally enumerate cases where people might run into names existing in multiple repositories today, what I've come up with:

  1. People who are using some name from repository X, and unknown to them repository Y starts serving something under that name.
    • This doesn't have to be malicious! It could be just someone using the same name for an entirely unrelated projects.
  2. Projects who are using the same name on multiple repositories (e.g. torch).
  3. People who are using some name from repository X, and have preemptively uploaded files to repository Y to "hold" the name so someone else can't use it.
  4. Downstream users publishing binary packages from repository X, onto repository Y.
  5. Users who publish their own version of some name from repository X, onto repository Y with some changes.
    • This is functionally the same as (4), but I'm calling it out independently because I think the ergonomics of it are different, these people are generally doing it for their own internal use, and a limited number of projects, while (4) tends to more for public consumption across a wide array of projects.
  6. The user has created a local wheelhouse, and is using that to supplement a public repository.
    • Again this is basically the same as (4), but calling it own independently because the ergonomics of a local wheel house are different than a public wheel house.
  7. People who have a mirror of repository X at repository Y, but also have repository X still included as a fallback.

Can people think of other situations? If so I would be interested to add them to my list.

Going through my list:

  1. This will break by default, and is the explicit goal to break it. Given two unrelated projects being served by the same name, pip does not actually know which project the user meant, and should not guess in the face of this ambiguity, not just because of the security concern, but also just because there's a 50% chance pip just gets it wrong and confuses the user.
  2. This will break by default, but with the "alternate repos" idea, the people who own the project that is sharing itself on multiple repositories can unbreak it for everyone.
  3. This will break by default, but this behavior isn't really "supported" in the most common case of PyPI, and with this change they can just delete that file (and in most cases they can delete those files without deleting the project and unbreak themselves as well).
  4. This will break, and it does so because it's functionally no different than (1). There are two repositories that are not related, the owners of the projects on those two repositories are not related. We as humans know that they're related, but there's no way for pip to actually know that (e.g. there's nothing stopping piwheels from just serving malicious packages).
    • This currently doesn't have a solution to unbreak these other than saying that this case the supplemental repository should just become a full repository (even if that involves proxying or linking to the "real" one).
  5. This will break, again because it's functionally the same as (1), two different owners are publishing to the same name.
    • This currently doesn't have a solution to unbreak other than saying that the repository should just become a full repository (even if that involves proxying or linking to the "real" one).
  6. This won't break, because we can special case the local filesystem as not counting for the purposes of determining if it comes from multiple repositories or not (if an attacker has control over the local filesystem, the user is already compromised).
  7. This will break, again because it's the same as (1).

Reading through that, it doesn't feel that bad to me?

(1) is the goal here, so it's the "expected" breakage. (2) will break, but we can provide an option for the project owners to unbreak it for all their users (which is better than individual users needing to unbreak it), which in theory means that (2) is actually not "really" broken. (3) is similar to (2) in that it breaks, but there is an easy path for project owners to unbreak it for all their users. and (4), (5), and (7) break with the current answer being to stop being supplemental repositories and become full repositories.

In writing out this list, another idea occurred to me, we could further reduce the breakage by taking the list of projects that have files that come from multiple repositories, and filtering out any project that has the same set of files in all repositories (by filename and by hash) [^1]. That would reduce the breakage in (7) to just cases where the mirrors have drifted apart, though it maybe makes the mechanism too hard to explain to users.

I think the biggest thing that we might be missing is something that allows end users to resolve the (4), (5), (7) cases and (1) when it's not an attack. We have precedence for this with HTTPS, where we added the --trusted-host flag to say that the user knows that the host doesn't have valid (or any) HTTPS, and they accept that risk.

There's a few ways that we could go about this.

One would be to just copy what we did for https, and add some sort of --trusted-index flag (TODO: a better name, all indexes should be trusted) that excludes that entire repository from being considered when checking if a project comes from multiple repositories. I'm not a huge fan of this one though, because while I think it would be nice for things like (4), (5), and (7), I think users are likely to reach for it in cases like (1) I think it would do the wrong thing because it would essentially be reverting to the current behavior.

Another option is we just say that pip doesn't handle it, these cases need to use a repository that is setup such that it's either not supplemental, or that it safely combines multiple repositories. I also think this is not great, because I think it will cause a lot of user complaints, particularly since pip will have gone from "working" (but insecure) to "not working" (but secure) and telling them to setup infrastructure that they never had to setup before to cope with that is likely to cause angry users.

So I think the option we need to end up in is one where we give users the ability to define what repositories are valid for their projects (with the default being just all repositories are valid, but only one can serve them unless alt repos mechanism is being used). This could look something like the map file that the original thread suggested, though I suspect it's more complicated than I would like, but something along those lines. This is what I would suggest.

So to summarize all of that, my suggestion is roughly:

  1. Provide an "alt repo" mechanism that projects can use to express ownership over the same name across multiple repositories.
  2. Provide a way to map specific projects to specific repositories, defaulting to mapping them to all of them.
    • This would basically act as a "filter" in PackageFinder.
  3. When files for a project are discovered on multiple repositories (post filtering from (2)), error out unless one of: a. All of the repositories are owned by the same project via the "alt repo" mechanism. b. The project is explicitly mentioned in the "mapping" mechanism (e.g. it's not relying on the default) c. The list of files are exactly identical (filename and hash) between all repositories [^2]

I think that completely protects against dependency confusion attacks by default, while limiting the breakage as much as possible, and where it can't be limited give as much power to owners of the projects to resolve them for all of their users, while still giving end users ultimate control to make these decisions.

Maybe another part of this discussion should be how we publicise whatever we decide, and prepare users for any breakage? Having a better message than the often-derided "we're doing this because it's good for you" one that people often get, might help with acceptance of the inevitable breakage (at whatever level we decide is justifiable).

I've never expected this to be something that just gets YOLO thrown into a pip release, if I gave off that impression I apologize. I consider things like "figure out how to manage the transition" to be table stakes for something like this.

I'm trying to get some agreement on what the end state should look like, I think what I've outlined above is pretty good and represents as little breakage as we can while still actually solving the problem. Once we have a destination in mind, then I was planning to figure out how to chart a course from where we are now to where we want to be in a way that reduces end user friction as much as possible.

I personally think that trying to figure out how to communicate what we decide and more broadly how to manage the transition is something that can only come once we have a destination, because what that destination is often times influences what the transition looks like.

[^1]: We need to be careful that we don't allow extra files that aren't on all repositories if we do this, because otherwise the attacker in torchtriton would have just needed to upload all the real torchtriton files to PyPI + their malicious ones. IOW, it has to be a perfect overlap of all files, not just some of them. [^2]: This is optional, I'm not sure if it's too confusing for people or not.

pradyunsg commented 1 year ago

Another option is to have an explicit ordering to the repositories, such that the "first" repository that we find the name in is treated as the canonical place -- i.e. use from the first index found. If the ordering is [X, Y, Z] and we find project-name in X, pip will use that and ignore all other indexes.

It's less disruptive since it'll break fewer workflows... Notably, it won't even break (1) or (2) and prevents the "someone else published something to a public repo and we silently started using it", as long as we have clear-enough semantics.

In our case, it ends up being appending reversed([options.index_url, *options.extra_index_urls]) and... most existing workflows that leave pypi.org as the default index URL will work as-expected.

Am I missing something about this problem, because this feels suspiciously straightforward?

pfmoore commented 1 year ago
  1. This will break by default, and is the explicit goal to break it. Given two unrelated projects being served by the same name, pip does not actually know which project the user meant, and should not guess in the face of this ambiguity, not just because of the security concern, but also just because there's a 50% chance pip just gets it wrong and confuses the user.

My concern here is, doesn't this just open up another form of attack, because an attacker just uploads a new totchtriton package to PyPI, which immediately triggers "duplicate file" errors for pytorch users. And the pytorch project has no remedy, as they have no legitimate claim to the torchtriton name on PyPI, so they can't ask for it to be taken down[^1].

[^1]: If you're not convinced, imagine it was a less well known name clash, like fb-utils, which for all I know might be an internal utility package at Facebook.

trishankatdatadog commented 1 year ago

Am I missing something about this problem, because this feels suspiciously straightforward?

It's not a bad idea because it should work for most use cases. However, there is at least one problem: suppose that a company like Meta wants to reserve the entire namespace fb-* (as in @pfmoore's fb-utils example). There is a race condition where an attacker who is aware of this fact could at least temporarily typosquat what should have been internal packages before they are officially reserved on the private index. Admittedly, it's a bit of a stretch of an attack, but it can happen. Thus, for some use cases, it may be important to say "do not backtrack even if a project matching a certain pattern does not exist on this index, but not necessarily on the rest of the following indices."

pfmoore commented 1 year ago

Companies can't reserve namespaces on PyPI, as far as I know, so that's not a legitimate scenario. But if Facebook are using the packages internally, and their internal server is prioritised, the code on PyPI would simply be ignored - so I think this is actually a case where @pradyunsg's suggestion just works, rather than being a problem with it.

And please stop introducing the idea of backtracking - the resolver isn't involved in this part of the process, so backtracking is irrelevant (and confusing, in a discussion which is already confusing enough 🙁)

dstufft commented 1 year ago

I don't think you can rely on the ordering of options.extra_index_url, because that can come from multiple sources:

While we can in fact define "an order" between those, it feels unlikely that those are always going to match the order that people expect, so I suspect that we would need to define a way to explicitly specify the order.

It may break fewer workflows, but I think it breaks workflows in a much more confusing way that is less obvious and harder for end users to diagnose. For instance, something like an install with pytorch + piwheels has a very subtle interaction depending on what order you put the indexes in, you would have to make sure that pytorch comes first, then piwheels, then PyPI, if you accidentally get piwheels first, then you're open to all the same problems again. That's of course a misconfiguration on the user's part, but I've never been a really big fan of security sensitive configuration where getting things right is subtle.

An important thing to remember here is, unless I misunderstand, piwheels is basically equivalent to PyPI in it's usefulness to execute a dependency confusion attack, since anything that shows up on PyPI ends up showing up on piwheels automatically.

I think if you're sufficiently careful, explicit ordering can mostly do the right thing, but I worry a lot that most users are not going to understand what "sufficiently careful" actually is, especially after we've spent the better part of 15 years telling them that repositories don't have semantic ordering in pip.

Companies can't reserve namespaces on PyPI, as far as I know, so that's not a legitimate scenario. But if Facebook are using the packages internally, and their internal server is prioritised, the code on PyPI would simply be ignored - so I think this is actually a case where @pradyunsg's suggestion just works, rather than being a problem with it.

I think we have done that (or can), but it's a very manual process so we don't really advertise it (it might even literally be special casing it in the code base, I don't recall).

It doesn't matter though, because as you said, ordered indexes, if the order is correct, isn't vulnerable to any name collision from less prioritized repositories.

trishankatdatadog commented 1 year ago

Companies can't reserve namespaces on PyPI, as far as I know, so that's not a legitimate scenario. But if Facebook are using the packages internally, and their internal server is prioritised, the code on PyPI would simply be ignored - so I think this is actually a case where @pradyunsg's suggestion just works, rather than being a problem with it.

It works only if fb-utils exists on the Facebook index (which comes first), but not PyPI. This is almost always the case, but I'm pointing out that nothing stops attackers from reserving what look like FB packages on PyPI. (Imagine a developer makes the typo fb-utilss— especially with one of those lovely MacBook Pro butterfly keyboards—and attackers have reserved such typos on PyPI.)

And please stop introducing the idea of backtracking - the resolver isn't involved in this part of the process, so backtracking is irrelevant (and confusing, in a discussion which is already confusing enough 🙁)

If it helps, think instead of not "terminating" when needed. Just because fb-utilss doesn't exist on the private FB index, doesn't necessarily mean you want to go on and search for it on PyPI.

dstufft commented 1 year ago

It works only if fb-utils exists on the Facebook index (which comes first), but not PyPI. This is almost always the case, but I'm pointing out that nothing stops attackers from reserving what look like FB packages on PyPI. (Imagine a developer makes the typo fb-utilss— especially with one of those lovely MacBook Pro butterfly keyboards—and attackers have reserved such typos on PyPI.)

I think "fb-utils" not existing on the meta index but existing on PyPI is essentially an unsolvable problem without either requiring users to explicitly declare where each and every package comes from OR removing support for multiple repositories completely and just say that every repository has to provide everything. I don't think it's worthwhile to fret over it.

Likewise I think fb-utilss is essentially an unsolvable problem, like all typosquats are, when you have a repository that anyone can register names on at any time. I don't think it's worthwhile to worry about it here.

Just because fb-utilss doesn't exist on the private FB index, doesn't necessarily mean you want to go on and search for it on PyPI.

I think this just boils down to any heuristic is going to have this problem unless you require every user to explicitly declare which repo which dependency comes from.

trishankatdatadog commented 1 year ago

I think this just boils down to any heuristic is going to have this problem unless you require every user to explicitly declare which repo which dependency comes from.

I agree: just allowing for prioritisation of indices with pip now will go a long, long way (which is an improvement over the original proposal of erroring out).

dstufft commented 1 year ago

I agree: just allowing for prioritisation of indices with pip now will go a long, long way (which is an improvement over the original proposal of erroring out).

I still think the error out proposal is still better FWIW.

That's mostly down to a few reasons:

  1. I don't like the idea of giving semantics, particularly security sensitive semantics, after we've spent 15+ years telling people that repositories don't have semantic ordering.
    • This can maybe be mitigated by making the ordering opt in by some explicit ordering mechanism... but then the feature only helps people who opt into it.
  2. I think the failure cases are way less obvious, since pip will just silently install things from the "wrong" repository.
    • This is basically a risk anytime you have a heuristic (in this case, what order someone specified --extra-index-url in, that isn't 100% perfect (which it won't be due to history). Picking.. something to do whether it's the "right" thing or not has the benefit that maybe you've guessed correctly and that's one less person broken, however it has the distinct disadvantage that for the people who it didn't pick "right" for, they're left with an install that inexplicably is selecting a different package than it used to.
    • Basically I think it trades "failing obviously and clearly in more situations" for "failing less obviously and in confusing ways in somewhat less situations".
      1. Whether a particular invocation of pip install ... with a set of repositories is secure or not is subtle, and requires knowing a lot about how each individual repository operates to know if you've got the order correct or not.
      2. AIUI, poetry already implements some form of repository prioritization, whose current behavior roughly matches what pip's current behavior is as well. They've gone back and forth a lot in the various issues I've read, but it seems like they've been trying to do something that's roughly a repository priority system, which their discussion around that seems to be full of users confused about how index prioritization actually works. If someone else is already struggling to implement this in a way that makes sense to users, are we going to do better?
      3. It is inherently less flexible, which means that more use cases are going to be broken with no path forward besides running their own repository proxy.
        • E.g., If someone had registered torchtriton for their own project on PyPI, without knowing about the pytorch use of it, there's no universal "right" answer for which one you're supposed to use.
  3. It makes attacks "quiet", which means they're less likely to be noticed, which means that they're more likely to be successful at some future date rather than remediated before they become a problem (say in cases where someone forgets to add their --extra-index-url).

My concern here is, doesn't this just open up another form of attack, because an attacker just uploads a new totchtriton package to PyPI, which immediately triggers "duplicate file" errors for pytorch users. And the pytorch project has no remedy, as they have no legitimate claim to the torchtriton name on PyPI, so they can't ask for it to be taken down.

It opens up another "attack" in the same way that HTTPS verification means that a MITM attacker can DoS you by intercepting your communication and presenting an invalid HTTPS certificate. It turns a potential attack into an error, which is generally what you want IMO, rather than silently guessing at what the "right" answer is.

In the torchtriton case, the torch project shouldn't have made a public repository, designed to be mixed with PyPI, that used names they didn't control on PyPI [^2]. What they did was inherently dangerous and wrong, and I assume it was an oversight. They've since renamed the package they depend on to pytorch-triton and ensured that they've claimed the name on PyPI as well. That's a good thing, and a hard error here would have meant that rather than finding out that they needed a new name because some of their users got compromised, they found out because pip stopped an active attempt to compromise them. [^1]

If they couldn't rename the torchtriton package AND PyPI couldn't or wouldn't grant them the name on PyPI from whoever had it (say for instance, it wasn't a malicious project, but instead just an unrelated project), then they've left their users in an actively ambiguous state. There would be no way to determine, without the users input, if pip install torchtriton was meant to get the pytorch version from the pytorch repository or the other (valid) project from PyPI, which my proposal refuses to guess at which one the user wanted, and gives the users a mechanism to explicitly say which one they wanted. In this case, ordering is going to do the wrong thing if the user wanted the PyPI version.

In the hypothetical fb-utils scenario where someone registers the internal name to PyPI, if you assume that Meta cannot or will not register the name on PyPI, then it's really equal to the second paragraph of the torchtriton, my proposal gives the hypothetical Meta the tools to specifically control where they get fb-utils from, without guessing which one it meant.

[^1]: In cases like this, I think you're better off renaming the project. It gets too hard to communicate out to users to determine if they've been compromised or not. You're best off renaming which gives you a bright, clear delineation of what is known to be safe, and what is not known to be safe. [^2]: Another way to think about your DoS attack against torchtriton is that even if you managed to correctly "guess" that the torchtriton from the custom repository is the correct one, by silently letting the conflict go unnoticed, you're leaving a landmine that someone else might run into later. For instance with the repository ordering proposal, it's less likely the fake torchtriton would have been discovered, and the number of users it would have compromised would have been smaller... but probably not zero. With the hard error, it's very obvious that something hinky is going on, so the fake package gets discovered almost immediately, protecting everyone.

pradyunsg commented 1 year ago

I don't think you can rely on the ordering of options.extra_index_url, because that can come from multiple sources:

I disagree.

  • CLI flags
  • Environment Variables
  • various pip configuration files

All of these just overwrite each other in the order specified in https://pip.pypa.io/en/stable/topics/configuration/#precedence-override-order

Each of these is well-ordered internally and the override order is well established and understood at this point.

  • requirements files

These override the default information loaded by the configuration stack and overwrites it. Besides, we control the semantics of this format and there's an obvious answer for what the ordering should be -- as provided in the file line-by-line.

(we're going reverse that as a later step)

pradyunsg commented 1 year ago

FWIW, color me convinced that the model that @dstufft is proposing will work out (search for "summarize all of that" in this thread).

The migration path for this will be an expensive one, but that's a cost I don't mind incurring.

pfmoore commented 1 year ago

FWIW, color me convinced that the model that @dstufft is proposing will work out

Agreed. I think we've covered the main questions here, and I don't have anything further to add. I agree it's going to be a potentially long and painful transition, and I'm not offering to do any of the work, which makes it easy for me to have opinions (🙂) but I'm happy that we take that path. Assuming no-one else has any concerns they still need addressing (I don't think @zooba is convinced yet) of course.

I am concerned that we don't have good routes for publicising and educating users (particularly internal corporate users) about changes like this, but I consider it as part of the "transition cost" for someone to come up with a solution there.

pradyunsg commented 1 year ago

Well, I have a few ideas, based on the resolver plans that we had:

I don't mind coordinating some of this TBH (subject to availability of course). This would firmly be in the camp of invisible work (🙃) but I am happy to pick this up once we have this settled down and implemented.

dstufft commented 1 year ago

I’m on my phone currently, but later today I can write up a more complete proposal for the end state that doesn’t involve someone having to read and digest this entire thread + the thread on discuss.p.o

I think that will be useful both for making sure everyone’s understanding is the same and for communicating this change out to people.

We will need to settle exactly on what the option to override what repo something comes from ends up looking like. I have some thoughts on this, but it’ll have to wait till I’m back on the computer to expand on them. 

dstufft commented 1 year ago

Also as always, a pleasure designing through this with you all :)

zooba commented 1 year ago

My fundamental concern here is that there doesn't seem to be a solution within pip for the case where a 3rd party takes your name on the public index.

Today, I'm doing (incorrectly, but bear with me):

pip install --extra-index-url https://LAN/pypi corp-package

This works fine - it pulls in my corp-package and also its dependencies directly from PyPI.

Then an attacker/innocent bystander comes along and publishes their own corp-package to PyPI. Now, my command above returns an error. Builds break, deployments break, long forgotten pagers start beeping, CEOs join group calls, etc. etc.

I'm safe... but then what am I meant to do? As far as I can tell, the only option is to avoid PyPI, avoid my existing index server, and avoid --extra-index-url and set up a new index that either resolves things properly or is fully curated. If there's an easier option here that I've missed, do let me know, because I'm assuming this is the best case for the rest of my thoughts.

So, we now have a start point, an end point, and a very feasible situation forcing us to go there. But the end point is a long way away, and it's not at all obvious that it's the best there is - surely pip knows that I want the package from the same place I was getting it last week? Surely it knows that when I explicitly specify a URL it should be preferred over invisible defaults? etc.

If this really is the best end point, we should push people towards it much earlier, probably as soon as they use --extra-index-url (without --no-index). They're likely going to need to set up infrastructure to solve an actual collision, and nobody's going to want to do that under pressure of sudden failures.

There are better mitigations for this issue, but they all fundamentally come down to constraining specific packages to a specific index. So maybe I add $env:PIP_CONSTRAIN_INDEX_CORP_PACKAGE="https://LAN/pypi" to my build and that one package gets special cased, but honestly I'd rather make this some kind of syntax in a constraints file.


I'll add that I'm not personally impacted, since at work we decided (before dependency confusion even became public) that the only consistent guidance we could give that all our devs had a chance of following was to use a single index that proxies up to everything else. So we've built the infrastructure, and are in the process of detecting and preemtively breaking anyone who's not doing it, so when something does go wrong we're well ahead.

But I don't think it's right for upstream to push that level of effort down to everyone. I'd love to see a simpler, self-serve option in the tooling, even if [$corp] we ignore it now that [$corp] we have our own approach.

dstufft commented 1 year ago

I'm safe... but then what am I meant to do? As far as I can tell, the only option is to avoid PyPI, avoid my existing index server, and avoid --extra-index-url and set up a new index that either resolves things properly or is fully curated. If there's an easier option here that I've missed, do let me know, because I'm assuming this is the best case for the rest of my thoughts.

You use the proposed configuration option that lets you say "for project X, these are the valid repositories", this would be the second bullet point (also called out in 3b) from one of my earlier posts:

So to summarize all of that, my suggestion is roughly:

  1. Provide an "alt repo" mechanism that projects can use to express ownership over the same name across multiple repositories.
  2. Provide a way to map specific projects to specific repositories, defaulting to mapping them to all of them.
    • This would basically act as a "filter" in PackageFinder
  3. When files for a project are discovered on multiple repositories (post filtering from (2)), error out unless one of: a. All of the repositories are owned by the same project via the "alt repo" mechanism. b. The project is explicitly mentioned in the "mapping" mechanism (e.g. it's not relying on the default) c. The list of files are exactly identical (filename and hash) between all repositories.
zooba commented 1 year ago

Okay, so 2 is in there (much like what I was getting at by the end of my scenario). I think I read 1 and had that in my head as "probably infeasible" and 2 didn't quite sink in.

Marketing-wise, I'd suggest focusing on that feature, with "name collisions now cause errors" falling out from ambiguity that can (now) be resolved by the user. The rest of the points are really edge cases or publisher responsibilities, and make the whole thing seem more complicated than it is (to the user).

pfmoore commented 1 year ago

I think this is a fair point. Making the key point of this rollout that "you can now select on a per-package basis precisely where the code you download will come from", rather than "we're making things more secure (by blocking problematic workflows)" is a much better message. Sell it as "we're giving you the tools to ensure you're fully protected from dependency confusion attacks, while switching the default to a model that's inherently secure for people who want things to just work" and we get both the "you're in control" and the "we're more secure" messages across[^1].

And the key thing IMO, is that if we don't want misunderstandings and negative messages spreading before we've even started, we need to start framing things like that now[^2].

[^1]: Someone with better marketing skills than me can probably make that sound better. [^2]: I think this is the source of my "vague unease feeling" above - it's not that I object to being secure, it's that security features get sold badly - "this may hurt, but it's for your own good". Corporate security theatre perpetrated by IT organisations that are out of touch with what the business needs (the environment I worked in far too often over the last 30+ years 🙁) have made people extremely wary of anything security-related that's "for their own good"...

dstufft commented 1 year ago

Another idea that I just came up with that will make the piwheels case work much smoother:

Provide some repository level metadata (provided by the repository owner, not the owners of individual projects within that repository) that allows to say that their repository "tracks" another repository.

As I'm discussing this with people, something that has solidified in my thinking here is that ultimately the problem has nothing to to do with multiple repositories, but everything to do with the fact that effectively each repository is it's own distinct namespace, but then we're implicitly merging those namespaces into a single flat namespace.

This means that historically pip has assumed that every single repository is sharing the same namespace, rather than implementing different namespaces.

Thus what this proposal effectively becomes is that pip starts to refuse to merge unrelated namespaces together, because that's inherently wrong to do. Then it provides tools for various people in the process to explicitly instruct pip when it is safe to merge those namespaces.

For end users, they get the ability to allow mapping a project to one or more repositories (which are really just namespaces), which also serves double duty as allowing them to choose which namespaces should be discarded.

For project authors, they get the ability to define the list of all the repositories (aka namespaces) where their project resides, which lets us know that for that project, it is safe to merge those namespaces. This has to be a bidirectional linking because when merging multiple namespaces, we don't know which one should "win", so if it was unidirectional we wouldn't know if we should trust it or not, but bidirectional shows ownership over both sides.

Currently we don't have anything for repository operators, because at first I was lumping them in mentally with project owners since both of their data comes "from" the repository API. But in thinking about this more, I don't think that's actually the right way to think about it.

Repository operators are inherently trusted in our system, if you don't trust the repository then you shouldn't use it at all, so there's no reason that we can't trust the repository operator to provide unidirectional data here to say that their namespace mirrors or "tracks" another repositories namespace.

That would end up meaning that the piwheels case doesn't even require end user configuration anymore, because the piwheels repository operators can specify that their namespace "tracks" the PyPI Namespace.

fridex commented 1 year ago

That idea is basically to encode a list of "alternate repositories" in the repository API response, and allow files to come from multiple repositories in the case that all repositories agree on those alternate repositories.

Does it mean the torch index would need to expose JSON API to support alternative_repositories and subsequent intersection computation?

  1. When files for a project are discovered on multiple repositories (post filtering from (2)), error out unless one of: ... c. The list of files are exactly identical (filename and hash) between all repositories

If I understand this correctly, this could be pretty expensive check as it might require downloading artifacts to check hashes - especially for own mirrors that do not provide hash as part of URL - from PEP-503 (hash part is not enforced):

The URL SHOULD include a hash in the form of a URL fragment with the following syntax: #=, where is the lowercase name of the hash function (such as sha256) and is the hex encoded digest.

dstufft commented 1 year ago

Does it mean the torch index would need to expose JSON API to support alternative_repositories and subsequent intersection computation?

Unless we add it to the HTML content type as well, yes it would.

If I understand this correctly, this could be pretty expensive check as it might require downloading artifacts to check hashes - especially for own mirrors that do not provide hash as part of URL.

My intention with this was to only support it if the hashes were provided, so it shouldn't have been particularly expensive and just provided an enhancement in cases where the hashes were present and all of the files matched.

However in thinking about it more, I've decided that this particular heuristic is a bad idea. It really only could help in perfect 1:1 mirrors, but there are more situations where a project might "track" another repository that it would be nice to support in a clean way.

The real nail in the coffin though was realizing even in the narrow case it is useful for, it would produce a really poor user experience. Mirroring a repository is effectively creating an eventually consistent distributed system where the mirror is always going to lag behind the primary, which would mean that during the lag period the two repositories will not have identical lists of files and would then trigger the error condition, but it would be a temporary failure that will typically resolve itself with a few minutes when the mirror has caught up. That sounds like an absolutely miserable experience for end users though, and in thinking about it I came up with the feature to allow a repository to declare it is tracking another repository, which solves the exact 1:1 mirror case AND the piwheels case and does it in a way that isn't prone to sporadic failures.

dstufft commented 1 year ago

I've gone ahead and written a document that's one part communication about the intended change, one part PEP-like document describing the change, and one part summarizing all the various ideas that have come up and why I've rejected them.

This document isn't itself authoritative in any way, I wrote it to try and collect all this information in one place that didn't require reading all of the back and forth across multiple threads of discussion going back multiple years. I've also tried to explicitly spell out what my proposal is in that document so that people don't have to try and reconstruct it from the multiple posts it's spread across in this thread as we talked it out and the idea evolved from that ongoing discussion.

I've shared it privately with some people, but I'm posting it here in a semi public place, but which likely doesn't have a super wide audience yet to see what the folks participating in this thread think about it. If everyone is on the same page and seems to think it's reasonable, I'll go ahead (or if someone else wants to, that's fine too) post it into discourse too.

Obviously there's bits in this doc that requires changes to the PyPI Repository API, which will ultimately need a PEP, luckily this doc is written in a PEP-like fashion so it should be pretty easy to copy most of that out into a PEP, but I figure we can cross that bridge when we come to it. Those PyPI changes are pretty useless without the pip specific changes that consume them, so there's no point in adding them until we make sure everyone is happy with where we're at.

Also please treat any specific names, cli flags, etc as place holders, bikeshedding names at this point isn't super useful, we'll have plenty of time for that later, I just threw whatever came to mind first in there to make the document less hand wavey.

Anyways, check out Securing pip against Dependency Confusion and let me know what you all think! [^1]

[^1]: I apologize in advance, sooner or later I'll figure out how not to type 6000 words just to present a single idea.

fridex commented 1 year ago

Anyways, check out Securing pip against Dependency Confusion and let me know what you all think!

On a high level, I really like the approach and I think it is a step forward to approach this problem.

+1 on the proposed idea to assign packages to indexes. When it comes to the content of the proposed repository file, would it make sense to make it more general and maybe the content could eventually become part of pyproject.toml, if makes sense (not to introduce another file)?

Does it mean the torch index would need to expose JSON API to support alternative_repositories and subsequent intersection computation?

Unless we add it to the HTML content type as well, yes it would.

If there is a need to create a repository by spinning up an HTTP server and serve distributions (which is convenient), this might require some additional configuration in the HTML template on the HTTP server side (if even possible). What about providing an ability to have a special file besides distributions that would allow declaring alternative repositories?

If I understand this correctly, this could be pretty expensive check as it might require downloading artifacts to check hashes - especially for own mirrors that do not provide hash as part of URL.

My intention with this was to only support it if the hashes were provided, so it shouldn't have been particularly expensive and just provided an enhancement in cases where the hashes were present and all of the files matched.

However in thinking about it more, I've decided that this particular heuristic is a bad idea. It really only could help in perfect 1:1 mirrors, but there are more situations where a project might "track" another repository that it would be nice to support in a clean way.

Security-wise the alternative repo idea could solve the issue when two indexes serve two wheels with the same filename but different digests. If there were declared alternative repos, there is created a contract to preserve trust between indexes, as stated somewhere above. So the check might not be necessary from security point of view. Nevertheless, would there be a defined behaviour for this case? Or is it generally undefined which distribution would be picked by pip? It looks like the document outlines these cases - wheels which have differences outside of the wheel tags - but not sure what behaviour would be expected here (sorry if it was raised somewhere and I missed it).

pfmoore commented 1 year ago

Unless we add it to the HTML content type as well, yes it would.

I feel like it’s necessary to add this to the HTML form. Otherwise it’s perilously close to deprecating the HTML version, something we claimed we weren’t doing. I know the PEPs say the 2 representations don’t have to stay in line, but omitting a feature that was added to address a security issue feels like a step too far.

Although @fridex’s suggestion of a special file (url? I’m not sure how a special top-level file would look) rather than a data item might be an alternative.

dstufft commented 1 year ago

When it comes to the content of the proposed repository file, would it make sense to make it more general and maybe the content could eventually become part of pyproject.toml, if makes sense (not to introduce another file)?

I kept it separate from pyproject.toml because pyproject.toml is used in more limited situations then pip itself is used, so a separate file is more generally useful.

If there is a need to create a repository by spinning up an HTTP server and serve distributions (which is convenient), this might require some additional configuration in the HTML template on the HTTP server side (if even possible). What about providing an ability to have a special file besides distributions that would allow declaring alternative repositories?

I feel like it’s necessary to add this to the HTML form. Otherwise it’s perilously close to deprecating the HTML version, something we claimed we weren’t doing. I know the PEPs say the 2 representations don’t have to stay in line, but omitting a feature that was added to address a security issue feels like a step too far.

I was on the fence about adding it to the HTML representation anyways, so I'll go ahead and sort out a reasonable-ish way of doing that and adjust it.

Security-wise the alternative repo idea could solve the issue when two indexes serve two wheels with the same filename but different digests. If there were declared alternative repos, there is created a contract to preserve trust between indexes, as stated somewhere above. So the check might not be necessary from security point of view. Nevertheless, would there be a defined behaviour for this case? Or is it generally undefined which distribution would be picked by pip? It looks like the document outlines these cases - wheels which have differences outside of the wheel tags - but not sure what behaviour would be expected here (sorry if it was raised somewhere and I missed it).

There's nothing in any of the repository API specs that require filenames be unique within a singular repository which means there certainly isn't a requirement that they be unique across multiple repositories. We don't define how installers select a file given multiple candidates in the repository API, we just present all the files and let the installers make that determination.

Pip isn't going to search the alternate locations by default or anything because of this proposal, it explicitly doesn't change where pip is configured to look for files, it just controls which of those locations, so it doesn't affect what pip will do in that situation where there is multiple identical files.

For pip the outcome is technically undefined, it's possible that at some later date pip is evolved to make it defined, but for now it is undefined.

zooba commented 1 year ago

I kept it separate from pyproject.toml because pyproject.toml is used in more limited situations then pip itself is used, so a separate file is more generally useful.

Very much agree with this. We're talking about a system/environment configuration setting here, not a project one (though often yes, you'll have projects that require different environments, but that doesn't change the nature of the setting).

I'm only -0 on creating a new file, though. I think pip.ini/pip.conf or a constraints.txt could handle this adequately, especially given the settings won't actually add any requirements or indexes, but will only constrain those that are requested. Maybe this is something to bikeshed later on though?

Other than that, the proposal looks good. I have a few quibbles with names, but those can definitely wait.

pfmoore commented 1 year ago

Maybe this is something to bikeshed later on though?

Definitely bikeshed later, but I will say up front that I'd like constraints.txt to remain strictly limited to being a list of requirements that filter the candidates pip sees, so I'd prefer any new config to go somewhere else.

zooba commented 1 year ago

strictly limited to being a list of requirements that filter the candidates pip sees

Isn't that what this does? Maybe I've misunderstood (or I'm framing it differently in my own mind), but it seems to be filtering candidates down to a specific index, so then the "are there no multiple indexes represented?" test passes and there's no error. I'm definitely imagining a different syntax from what's proposed for the toml file, but I'm also convinced it's just filtering out candidates.

pfmoore commented 1 year ago

Let's discuss it later, but what I'm saying is I don't want to change the existing syntax of constraints.txt.

dstufft commented 1 year ago

Agree on it not being important at this stage. That’s just an implementation detail of the general idea.