Unexpected invalid project queries caused by `--extra-index-url`

tgolsson commented 10 months ago

Description

Hey!

Posted this as a question on IRC but got no response at all and can't find any existing issues, so proceeding with a bug report. I'm noticing a significant slowdown when using --extra-index-url with PyTorch's repositories. After digging into it I'm noticing a lot of error 403 Forbidden for individual projects that only exist on PyPi. The 403 makes sense, but given that --extra-index-url requires a PEP503 compliant index those should not be made -- the root page lists all available projects and should be preferred.

To push this to the extreme, I did a very simple test where I took a small random set of deps from PyPi top 10 + some of my actual dependencies. Importantly, no large torch wheels - biggest is numpy. The "base" case only takes ~5 seconds, but then adding a torch index adds >20 seconds to that! With a larger set of dependencies (from our actual codebase), each Torch repository adds ~40 seconds, while running without those only takes 15 seconds.

Spending an hour trying different combos, I generated the below graph. This is only using torch extra indexes, because they're extraordinarily slow, but I think Pip greedily querying each package is the actual issue here. If it just fetched the index once it'd get rid of all these extra calls and have no slowdown.

Expected behavior

PIP does not needlessly query project pages for projects that are not listed on a PEP503 root page.

pip version

23.3.1

Python version

Multiple

OS

WSL/Linux

How to Reproduce

Run this command: pip download -vv --extra-index-url https://download.pytorch.org/whl/cpu/ attrs
Observe 403's in the log
Note that attrs is not on the index page, and there should have been no query for it from there.

Output

$ pip download --extra-index-url https://download.pytorch.org/whl/cpu/ attrs -vv
Created temporary directory: /tmp/pip-build-tracker-7fbaxk8_
Initialized build tracking at /tmp/pip-build-tracker-7fbaxk8_
Created build tracker: /tmp/pip-build-tracker-7fbaxk8_
Entered build tracker: /tmp/pip-build-tracker-7fbaxk8_
Created temporary directory: /tmp/pip-download-6m8lybxt
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu/
2 location(s) to search for versions of attrs:
* https://pypi.org/simple/attrs/
* https://download.pytorch.org/whl/cpu/attrs/
Fetching project page and analyzing links: https://pypi.org/simple/attrs/
Getting page https://pypi.org/simple/attrs/
Found index url https://pypi.org/simple/
Looking up "https://pypi.org/simple/attrs/" in the cache
Request header has "max_age" as 0, cache bypassed
Starting new HTTPS connection (1): pypi.org:443
https://pypi.org:443 "GET /simple/attrs/ HTTP/1.1" 304 0
Fetched page https://pypi.org/simple/attrs/ as application/vnd.pypi.simple.v1+json
  <SNIP for Found link spam>
Fetching project page and analyzing links: https://download.pytorch.org/whl/cpu/attrs/
Getting page https://download.pytorch.org/whl/cpu/attrs/
Found index url https://download.pytorch.org/whl/cpu/
Looking up "https://download.pytorch.org/whl/cpu/attrs/" in the cache
Request header has "max_age" as 0, cache bypassed
No cache entry available
Starting new HTTPS connection (1): download.pytorch.org:443
https://download.pytorch.org:443 "GET /whl/cpu/attrs/ HTTP/1.1" 403 None
Status code 403 not in (200, 203, 300, 301, 308)
Could not fetch URL https://download.pytorch.org/whl/cpu/attrs/: 403 Client Error: Forbidden for url: https://download.pytorch.org/whl/cpu/attrs/ - skipping
Skipping link: not a file: https://pypi.org/simple/attrs/
Skipping link: not a file: https://download.pytorch.org/whl/cpu/attrs/
Given no hashes to check 51 links for project 'attrs': discarding no candidates
Collecting attrs
  File was already downloaded /home/ts/Repositories/erupt/attrs-23.1.0-py3-none-any.whl
Created temporary directory: /tmp/pip-unpack-wvrr13ns
Created temporary directory: /tmp/pip-unpack-_z8e1xys
Successfully downloaded attrs
Remote version of pip: 23.3.1
Local version of pip:  23.3.1
Was pip installed by pip? True
Removed build tracker: '/tmp/pip-build-tracker-7fbaxk8_'

Code of Conduct

[X] I agree to follow the PSF Code of Conduct.

tgolsson commented 10 months ago

I'd be happy to take a stab at fixing this (i.e., use the root page index to elide queries) if it'd have a meaningful chance of being accepted.

pfmoore commented 10 months ago

We don't use the root page because fetching the root page from PyPI on every run would likely be extremely costly (it's pretty huge, and changes often so caching wouldn't help that much, if at all).

By all means have a look at doing something like this, but be aware that it wouldn't be acceptable to slow down the normal case (downloads served just from PyPI) in the process.

Also, you don't say how many dependencies are involved, but an extra 20 seconds is pretty bad unless you have thousands of dependencies doing failed lookups. Could it be that there's a performance issue on the pytorch index when you try to fetch an unknown package?

tgolsson commented 10 months ago

We don't use the root page because fetching the root page from PyPI on every run would likely be extremely costly (it's pretty huge, and changes often so caching wouldn't help that much, if at all). By all means have a look at doing something like this, but be aware that it wouldn't be acceptable to slow down the normal case (downloads served just from PyPI) in the process.

Agreed and ack'd. I think some form of heruistic like don't use root page for the --index-url but do for --extra-index-url would maybe make sense? Or we check every url for whether it's PyPi and hardcode no-indexing there.

Also, you don't say how many dependencies are involved, but an extra 20 seconds is pretty bad unless you have thousands of dependencies doing failed lookups. Could it be that there's a performance issue on the pytorch index when you try to fetch an unknown package?

It is terribly slow, yes. We have 76 deps in our production lockfile (which takes 2 minutes to pip download). I don't think that the bad performance is pips fault, but it's definitely only visible because pip is overly aggressive in its requests.

pfmoore commented 10 months ago

I think some form of heruistic like don't use root page for the --index-url but do for --extra-index-url would maybe make sense? Or we check every url for whether it's PyPi and hardcode no-indexing there.

Personally I'd be against hard coding PyPI, or heuristics like different behaviour for --index-url. It's too likely to cause user confusion when changes that should make no difference cause a significant performance problem.

It is terribly slow, yes. We have 76 deps in our production lockfile (which takes 2 minutes to pip download). I don't think that the bad performance is pips fault, but it's definitely only visible because pip is overly aggressive in its requests.

Yeah, that sounds very much like a problem that pyTorch should solve with their index. Checking that a URL doesn't exist should be quick. The fact that you're getting 403 Client Error: Forbidden for url rather than a 404 suggests that the index is doing something more complex than just checking that the project exists.

Maybe the index should implement a fast check for whether the project exists, before doing whatever permission stuff it is that's so slow.

The more I hear about this, the more I think it's not something we should be working around in pip. Especially as the pytorch project looks like it has far more resources available than pip does...

tgolsson commented 10 months ago

Personally I'd be against hard coding PyPI, or heuristics like different behaviour for --index-url. It's too likely to cause user confusion when changes that should make no difference cause a significant performance problem.

I'm happy to take other suggestions! I'm not familiar with the codebase. If there's a need to specialcase one index then I'm happy to do it. I was just spitballing some ways of achieving that without having read a single line of the existing code.

Yeah, that sounds very much like a problem that pyTorch should solve with their index. Checking that a URL doesn't exist should be quick. The fact that you're getting 403 Client Error: Forbidden for url rather than a 404 suggests that the index is doing something more complex than just checking that the project exists.

Actually, I think they're literally serving it from a S3 bucket which returns 403 by default. By the way S3 permissions work; 404 means "this doesn't exist, which you know because you would see it otherwise" while 403 means "I can't tell you if that exists". While I'm sure they could make the listing permissions wider, there is nothing in PEP503 that says they have to return anything for URLs that do not exist.

Maybe the index should implement a fast check for whether the project exists, before doing whatever permission stuff it is that's so slow.

The more I hear about this, the more I think it's not something we should be working around in pip. Especially as the pytorch project looks like it has far more resources available than pip does...

I mean, I'm 100% in favour of torch solving all of their issues with publishing, of which a slow index is just one thing I can complain about. From my perspective though, I'm able to fix issues in pip, which I'd severely doubt for torch's package index. And fixing torch is just one of many package indexes that may have whatever latency that makes this really slow. I cannot even for sure say that their index is that slow or just remote. I have a ~600ms response per request. If their server is in the US, ~300ms of that would be just literal physical distance. And that implies at least half the solution would be me relocating...

edmorley commented 10 months ago

they're literally serving it from a S3 bucket which returns 403 by default. By the way S3 permissions work; 404 means "this doesn't exist, which you know because you would see it otherwise" while 403 means "I can't tell you if that exists".

AWS S3 doesn't return a 403 by default, it returns a 403 if the bucket permissions haven't been configured suitably for a public static assets use case (by default a bucket is private, so the permissions they have are whatever they've set up themselves). To make sure non-existent assets return 404, the bucket owner needs to allow the s3:ListBucket and not just s3:GetObject permissions.

sbivol commented 6 months ago

I hit this bug with a private package repo: adding --extra-index-url https://private-repo in requirements.txt causes pip to request packages from private-repo, get a 404 response, and only then fall back to PyPI

My private index advertises one package, and the documentation for --extra-index-url seems to contradict observed behavior:

--extra-index-url <url>
              Extra URLs of package indexes to use in addition to --index-url.

At least to me this reads like --extra-index-url would come after --index-url in the hierarchy, not before it.

uranusjr commented 5 months ago

One small correction—pip does not “fall back” to PyPI; it’ll query PyPI even if the request from the other index does not return 404. That’s how supplying multiple indexes work. If you want more control over which index to access on which package, consider putting a index router in front of the multiple indexes, and configure routing on that level instead.

Possibly relevant discussion: https://discuss.python.org/t/18242

pypa / pip