pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.52k stars 3.03k forks source link

Using multiple PIP indexes on the same hostname with different credentials does not work #10902

Open roman-kouzmenko opened 2 years ago

roman-kouzmenko commented 2 years ago

Description

I have a need to simultaneously access two PIP indexes from the same hostname (pkgs.dev.azure.com) but using different credentials.

When configuring it like this:

PIP_INDEX_URL=https://build:password1@pkgs.dev.azure.com/feed1
PIP_EXTRA_INDEX_URL=https://build:password2@pkgs.dev.azure.com/feed2

pip seems to try credentials for feed2 for both feed1 and feed2 failing my builds.

I've worked around this for now by setting the same credentials for both feeds.

Expected behavior

feed1 credentials are used with feed1 and feed2 credentials are used with feed2

pip version

21.1.3

Python version

3.9

OS

linux

How to Reproduce

  1. Create two Azure feeds in different organizations, for example pkgs.dev.azure.com/org1/_packaging/org-feed/pypi/simple and pkgs.dev.azure.com/org2/_packaging/org-feed/pypi/simple
  2. Upload package1 to feed1, package2 to feed2
  3. Generate different personal access tokens PAT1 and PAT2 for the two feeds
  4. Set environment variables `IP_INDEX_URL=https://build:PAT1@pkgs.dev.azure.com/org1/_packaging/org-feed/pypi/simple and PIP_EXTRA_INDEX_URL=https://build:PAT2@pkgs.dev.azure.com/org2/_packaging/org-feed/pypi/simple
  5. Run pip install package1 package2

Output

pip interactively prompts for username breaking the build instead of installing the two packages.

Code of Conduct

pradyunsg commented 2 years ago

I think this is expected behaviour. I suggest reaching out to Azure support, asking them to explore solutions for this on their end.

Our authentication mechanisms have the requirement/assumption baked in that each domain will only have a single username-password pair (or auth token) associated with it. A PR documenting this would be welcome.

21.1.3

Can you confirm that the behaviour is same with the current release of pip?

roman-kouzmenko commented 2 years ago

Just tested and confirm that the behaviour is the same on the current release. Let me also raise a PR to document this.

I will expore the Azure support suggestion too.

It would be nice if it worked as it is possible to pass different authentication to a same domain and it's natural to expect it to work.

It took me around an hour to understand why my pipeline was not working as this was all hidden behind something like this in Azure pipelines that automatically populates the PIP_INDEX_URL and PIP_EXTRA_INDEX_URL environment variables:

- task: PipAuthenticate@1
  displayName: 'Authenticate to feeds'
  inputs:
      pythonDownloadServiceConnections: feed1, feed2
potiuk commented 2 years ago

I believe it's something that shoud be fixed on the pip side as Azure's scheme is fully compliant with RFC7617 2.2.

Reusing Credentials

Given the absolute URI ([RFC3986], Section 4.3) of an authenticated request, the authentication scope of that request is obtained by removing all characters after the last slash ("/") character of the path component ("hier_part"; see [RFC3986], Section 3). A client SHOULD assume that resources identified by URIs with a prefix-match of the authentication scope are also within the protection space specified by the realm value of that authenticated request.

pradyunsg commented 2 years ago

In which case, a PR fixing this would be welcome. :)

potiuk commented 2 years ago

In which case, a PR fixing this would be welcome. :)

Actually I've planned to take a look at adding it :)

roman-kouzmenko commented 2 years ago

In which case, a PR fixing this would be welcome. :)

Actually I've planned to take a look at adding it :)

I looked a bit at the code, I think the fix should be simple by replacing netloc with the "root repository URL" when we cache credentials over here: https://github.com/pypa/pip/blob/ec8edbf5df977bb88e1c777dd44e26664d81e216/src/pip/_internal/network/auth.py#L202

What I am not clear about however is how does one define a "root repository URL" exactly.

PEP503 seems to define it to stop after simple.

However, in the Azure feed case, my index URL looks like pkgs.dev.azure.com/ORG_NAME/_packaging/FEED_NAME/pypi/simple, but i see pip install trying to resolve URLs (and fetching credentials for them) such as pkgs.dev.azure.com/ORG_NAME/_packaging/FEED_NAME_UUID/pypi/download.

Two issues with this:

  1. The root seems to be at pypi which should be fine to implement.
  2. The Azure feed name gets replaced by its UUID in subsequent calls. Would these be redirects or some HATEOAS features of how the repository API works?
potiuk commented 2 years ago

I looked a bit at the code, I think the fix should be simple by replacing netloc with the "root repository URL" when we cache credentials over here:

https://github.com/pypa/pip/blob/ec8edbf5df977bb88e1c777dd44e26664d81e216/src/pip/_internal/network/auth.py#L202

Yep. looks like the right place.

What I am not clear about however is how does one define a "root repository URL" exactly.

PEP503 seems to define it to stop after simple.

I believe PEP503 does not define the "root" repository to have "simple". The "simple is just an example (this is how PyPI indesx define it) but it can be arbitrary path (also one including '/' - that's how I read it at least). However it does define that there must be a "project" as the last part of the path or how long the "root" URL is.

However, in the Azure feed case, my index URL looks like pkgs.dev.azure.com/ORG_NAME/_packaging/FEED_NAME/pypi/simple, but i see pip install trying to resolve URLs (and fetching credentials for them) such as pkgs.dev.azure.com/ORG_NAME/_packaging/FEED_NAME_UUID/pypi/download.

I think that should not really matter. From https://datatracker.ietf.org/doc/html/rfc7617#section-2.2 - the "security" context can also match the prefix - not the whole "URL" including the path (also if you look further - it does not decide what to do in case of overlapping contexts - it leaves it to the decision of the client).

So I think what should happen here, the authentication information should be kept in an an array with (prefix, authinfo) tuples, not a dictionary. and the check should not find the exact match, but it should find the longest match from all registered prefixes.

pfmoore commented 2 years ago

PEP 503 defines the API in tems of the root URL. I don't know enough about URL semantics to be definitive, but I'm pretty sure the expectation was that a root URL would just be something like https://example.com/. Things like query and fragment parts make no real sense, for example. The username and password parts are a grey area. I'm 99% certain that the expectation there was that username/password would not be part of the root URL, but would count as additional data (in the sense that it should make no difference whether the root was specified as https://me:mypass@example.com/ or as https://example.com/, with the username and password supplied "out of band" (e.g. via a dialog box)). In particular, I'm pretty certain we never considered the possibility of two indexes that only differed in authentication data.

So yeah, if you're looking at "what does the PEP say we should do", I think the answer isn't clear. I feel as though the Azure behaviour is unusual enough that this is likely to be just the tip of the iceberg here, and we may want to hold off until we better understand how (or even if) their behaviour conforms to the relevant standards - possibly even updating the standards as needed. I don't want pip's behaviour to end up being a "de facto" requirement for all tools, so I'm concerned about making behavioural decisions here as opposed to just following the spec.

Ping @zooba for his Azure expertise.

potiuk commented 2 years ago

Any comments on the RFC7617 @pfmoore ? For me this is a "lower-level" standard (pretty established) about generic basic-authentication behaviour, which includes authentication scopes.

At least as I read it it is quite clear - this is a basic authentication, and 'path' is part of the authentication scope, so if a user matches "URL/path". it does not match "URL/other_path". At least this is what I read from it.

I am not sure PEP 503 is in any relation to that?

potiuk commented 2 years ago

(BTW. I am just about to submit PR "fixing" it so we can discuss looking at the code proposal)

potiuk commented 2 years ago

Submitted #10904 - which should be pretty complete implementation of multi-domain matching (@roman-kouzmenko - you might want to install via pip install git+https://github.com/potiuk/pip.git@fix-behaviour-of-auth-information and check if my change works for you).

roman-kouzmenko commented 2 years ago

At least as I read it it is quite clear - this is a basic authentication, and 'path' is part of the authentication scope, so if a user matches "URL/path". it does not match "URL/other_path". At least this is what I read from it.

That's how I interpret this as well but I don't think it will solve the azure issue as mentioned above as it might transform URL/path to URL/better_path_for_azure (still not sure if it's done by pip through redirects or HATEOAS).

Anyway, will try your PR tomorrow morning (CET) and report back.

potiuk commented 2 years ago

I am also not sure if this is a "full" solution (for me this is a first deeper look at pip code and it is about passwords and authentication so I woudl be cautious with my own change :). But I think I reviewed all the places where the "passwords" field of the multi-domain authentication was used though and reviewed/updated all the tests (and added new ones) and tested them locally so I am reasonably confident it might work.

There is a part about redirect and it is HTTP 401 unauthorized - so I am pretty sure there is no HATEOAS.

It works in the way that it stored the authentication URL also for the redirect URL, which I am not sure is the best thing to do so it might need some slight changes too:

        # Store the new username and password to use for future requests
        self._credentials_to_save = None
        if username is not None and password is not None:
            self.passwords[parsed.netloc] = (username, password)

UPDATE: Not sure about the redirect (401 is unauthorized of course, silly me) - but I am almost sure there is no HATEOAS and the prefix matching should work just fine (providing of course that the direction of solving the problem will be accepted).

potiuk commented 2 years ago

Also @roman-kouzmenko - it shoudl not matter in your case because I am matching the longest prefix and take the original urls only (which are - I understand https://pkgs.dev.azure.com/feed1 and https://pkgs.dev.azure.com/feed2) so - no matter what kind of redirection happen IMHO it should work.

pfmoore commented 2 years ago

Any comments on the RFC7617 @pfmoore ?

Nope, I have no knowledge of (or interest in) the intricacies of URL handling.

Note that the documentation on how pip handles multiple indexes is here. It's not very much, as the algorithm is pretty simple currently. This discussion adds a lot of complexity, and whatever conclusion we come to should be incorporated in that section of the docs. And personally, I'm of the view that if we can't describe the behaviour sufficiently concisely (by which I mean, if we double the size of that section it's too much) then the behaviour is too complex.

Also, I should say that I'm not even convinced that we should be supporting this. I appreciate that refusing to support Azure would be a pretty big issue, but conversely, I think we should be discussing whether Azure could implement something that requires less complexity from client tools (and remember, pip is not the only client in this situation - devpi-server, for example, will presumably have the same sort of issues mirroring Azure indexes).

roman-kouzmenko commented 2 years ago

@potiuk, thanks for the PR, I like your solution.

Note there is a bug here I believe, this does not compute the length of the matching prefix but the number of matching characters at the same position anywhere within the two strings: https://github.com/potiuk/pip/blob/c1945603474c58d561c54a44e2d640e3f6bca7d3/src/pip/_internal/network/auth.py#L203

We could use something like this instead

def calculate_common_prefix_len(string1: str, string2: str) -> int:
    for i, (char1, char2) in enumerate(zip(string1, string2)):
        if char1 != char2:
            break
    return i
roman-kouzmenko commented 2 years ago

Any comments on the RFC7617 @pfmoore ?

Nope, I have no knowledge of (or interest in) the intricacies of URL handling.

Note that the documentation on how pip handles multiple indexes is here. It's not very much, as the algorithm is pretty simple currently. This discussion adds a lot of complexity, and whatever conclusion we come to should be incorporated in that section of the docs. And personally, I'm of the view that if we can't describe the behaviour sufficiently concisely (by which I mean, if we double the size of that section it's too much) then the behaviour is too complex.

Also, I should say that I'm not even convinced that we should be supporting this. I appreciate that refusing to support Azure would be a pretty big issue, but conversely, I think we should be discussing whether Azure could implement something that requires less complexity from client tools (and remember, pip is not the only client in this situation - devpi-server, for example, will presumably have the same sort of issues mirroring Azure indexes).

I don't think anything would need to change in the documentation, the proposal simply improves how credentials are cached: right now, it is already possible to use two indexes on the same domain (such as https://pkgs.dev.azure.com/feed1 and https://pkgs.dev.azure.com/feed2) but only if credentials are the same for the two indexes. If they are different, one of them will randomly overwrite the other as the cache key is simply the domain name.

roman-kouzmenko commented 2 years ago

UPDATE: Not sure about the redirect (401 is unauthorized of course, silly me) - but I am almost sure there is no HATEOAS and the prefix matching should work just fine (providing of course that the direction of solving the problem will be accepted).

I suspect most likely requests is following redirects automatically which is its default behaviour.

potiuk commented 2 years ago

Note there is a bug here I believe, this does not compute the length of the matching prefix but the number of matching characters at the same position anywhere within the two strings:

Yeah. I fixed it in the latest fixup - and added a test there. This is now quite a bit simpler but also strictly follows the RFC (i.e. only full match of the URL witht authentication is used. Previously partial matches were also possible and feed2 could reuse authentication with feed1 which was not really correct as those are two different authentication scopes. The fix is simpler and more readable.

I've also proposed a documentation update and NEWS update to explain the change in behaviour

zooba commented 2 years ago

I've only skimmed the discussion, but probably it should be caching anything feed related against the index URL as provided by the caller rather than just the netloc.

I noted this when implementing the keyring interface, and there's logic somewhere that keeps track of the original index URL leading to a request for this purpose, but I didn't have a reason to change the netloc-based caching at the time.

potiuk commented 2 years ago

I've only skimmed the discussion, but probably it should be caching anything feed related against the index URL as provided by the caller rather than just the netloc.

Yep. That's what my proposed PR #10904 does (and it also follows the RFC 3986 with regards to prefix-matching (though that part might not really by needed, I hardly imagine two pip repositories where one would be subpath of the other).

fliiiix commented 1 year ago

I think this issue here should also be closed since it seems like the plan is to keep the way it is right now https://github.com/pypa/pip/pull/10904#issuecomment-1194192733

pjstevns commented 1 year ago

Please to not close this issue.

It affects also users of gitlab where different groups in gitlab have their own package repositories, each using their own credentials.

It used to work, but no longer does (don't know since what version of pip). It now forces us to use separate requirements files for a single project, since each file basically support one set of credentials.

10io commented 1 year ago

It used to work, but no longer does (don't know since what version of pip). It now forces us to use separate requirements files for a single project, since each file basically support one set of credentials.

@pjstevns

The answer is here. Long story short, we (yes, I work at GitLab) deployed a security fix where we now enforce the credentials in the file download endpoints.

Due to the situation described in this issue (multiple indexes with different credentials), $ pip is sending the wrong credentials and the GitLab PyPI registry will not be happy about that (💥).

I described a possible workaround that might work depending on your situation: using a group deploy token where the target group contains all projects targeted by the index urls.

potiuk commented 1 year ago

I just stumbled on those comments while catching up with some pip message. Since I noticed my comment was mentioned, I wanted to maybe help others by offering a summary of what I understand about the issue.

Note - this is not a complaints or critique (so pleasee pip maintainers don't take it so). It's just explanation how I see the state of the issue and explain my experience in trying to solve it, so that others have a full picture.

I initilaly raised this issue as a security issue initially (and formally using responsible disclosure) before I opened the PR. I still think this is an exploitable security vulnerability that is CVE worthy. but it's my personal opinion. In Airflow I am one of the PMC members deciding on the fate of security issues raised, last few months I assigned, solved and published some 6 of CVS reported to us, but I have no powers to decide in pip - so this is just an opinion of random user/contributor, nothing else.

In private discussions that followed the responsible disclosure of mine and also confirmed in the public comment https://github.com/pypa/pip/pull/10904#issuecomment-1194192733 mentioned above, it's been downplayed by pip maintainers and classified as not a security isssue. That's why I openly mention it here - otherwise I would never disclose it in public discussion. But it's the decision of maintainers, and I have no choice but to accept it with all the consequences.

I think maintainers in any project (we've done that in Airflow too) have the full right to declare they do not follow some standards (RFC7617 in this case), and just leave the problem to the companies implementing pip registry serving multiple crendentials (which GitLab apparently did). And if this done deliberately and they take the consequnces of it by knowing potential security implications (I hope I explained it clearly enough by providing some posisble attack scenarios) - they are in full right to decide what to do. Powers comes with responsibility and this is a very clear manifestation of it.

So after long discussions and iterating on the PR, to try to fix it and especially last comment, mentioned - I decided to close it. I generally like to make the world a bit better place one-commit-at-a-time but, in this case if the maintainers declared it as a no-security issue and since it does not impact my or my project's use of pip (Airflow is public, open-source project and generally the authentication affects us very little) and my approach seemed to be too complex and not accepted, I decided to close the PR and spend my energy elsewhere.

I have no good advice for others - other than offering this summary of the state of the issue (at least this is how I see and remember it). That was my own decision for my PR - but - of course - anyone is free to pick my PR of course and maybe they will have better luck. Or maybe approach it differently learning from my story.

I hope this will be helpful summary for anyone interested in this issue.

krosk commented 1 year ago

This is an active problem affecting users in Azure DevOps.

I have faced this exact same issue this week, trying to connect to two feeds from the same Azure domain but different organizations. More specifically:

Workarounds for those coming from Azure DevOps: 1/ As OP mentioned, setting up the same credential for both feeds; in my case I would have to connect to the 1st feed using a service connection with the same user generated PAT as the 2nd feed; whether it is doable depends on if you can generated PATs that are valid cross-organization. 2/ Setting up the 2nd feed as an upstream source of the 1st feed; whether it is doable depends probably on permission levels 3/ Call two separate pip install commands, each one connecting to one feed only; whether it is doable depends on if you can afford to install packages in a certain order, i.e. install the dependencies first from one feed, then install the ones that need dependencies after that.

Cheaterman commented 1 year ago

There's another workaround, one can give a direct link to a wheel using @ if they want to keep separate credentials. Still, that forces one to pin a version for the package, which may not be desirable.

EDIT: Example:

mypackage @ https://user:pw@some_source/pypi/files/c0ffeebabe/mypackage-0.0.1-py3-none-any.whl
Darsstar commented 1 year ago

I probably made this possible with https://github.com/pypa/pip/pull/11698/files#diff-a88f002c8cad3308467fd2d7f55ae33f8e0538dfa275d1052797fbd93e0e3099R120, which is available since 23.1.

I tested it across two ADO organizations, with pip download. I am able to download packages from both feeds/indexes while the package does not exist on the other index.

carlfischerjba commented 1 year ago

I probably made this possible with https://github.com/pypa/pip/pull/11698/files#diff-a88f002c8cad3308467fd2d7f55ae33f8e0538dfa275d1052797fbd93e0e3099R120, which is available since 23.1.

Works for me with pip 23.3.1. Thank you.

Where previously we had to do something like this (dependencies installed separately first):

pip install --extra-index-url "https://user1:token1@gitlab.com/api/v4/projects/12345/packages/pypi/simple" package1
pip install --extra-index-url "https://user2:token2@gitlab.com/api/v4/projects/23456/packages/pypi/simple" package2
pip install --extra-index-url "https://user3:token3@gitlab.com/api/v4/projects/34567/packages/pypi/simple" package3

We can now do this (install just the package we need with dependencies being pulled in by pip):

pip install \
--extra-index-url https://user1:token1@gitlab.com/api/v4/projects/12345/packages/pypi/simple \
--extra-index-url https://user2:token2@gitlab.com/api/v4/projects/23456/packages/pypi/simple \
--extra-index-url https://user3:token3@gitlab.com/api/v4/projects/34567/packages/pypi/simple \
package3

While the two options look very similar, the first one rarely worked properly and usually required specifying versions for the dependencies that matched exactly what package3 needed.