Open tgolsson opened 10 months ago
I'd be happy to take a stab at fixing this (i.e., use the root page index to elide queries) if it'd have a meaningful chance of being accepted.
We don't use the root page because fetching the root page from PyPI on every run would likely be extremely costly (it's pretty huge, and changes often so caching wouldn't help that much, if at all).
By all means have a look at doing something like this, but be aware that it wouldn't be acceptable to slow down the normal case (downloads served just from PyPI) in the process.
Also, you don't say how many dependencies are involved, but an extra 20 seconds is pretty bad unless you have thousands of dependencies doing failed lookups. Could it be that there's a performance issue on the pytorch index when you try to fetch an unknown package?
We don't use the root page because fetching the root page from PyPI on every run would likely be extremely costly (it's pretty huge, and changes often so caching wouldn't help that much, if at all). By all means have a look at doing something like this, but be aware that it wouldn't be acceptable to slow down the normal case (downloads served just from PyPI) in the process.
Agreed and ack'd. I think some form of heruistic like don't use root page for the --index-url
but do for --extra-index-url
would maybe make sense? Or we check every url for whether it's PyPi and hardcode no-indexing there.
Also, you don't say how many dependencies are involved, but an extra 20 seconds is pretty bad unless you have thousands of dependencies doing failed lookups. Could it be that there's a performance issue on the pytorch index when you try to fetch an unknown package?
It is terribly slow, yes. We have 76 deps in our production lockfile (which takes 2 minutes to pip download). I don't think that the bad performance is pip
s fault, but it's definitely only visible because pip
is overly aggressive in its requests.
I think some form of heruistic like don't use root page for the --index-url but do for --extra-index-url would maybe make sense? Or we check every url for whether it's PyPi and hardcode no-indexing there.
Personally I'd be against hard coding PyPI, or heuristics like different behaviour for --index-url
. It's too likely to cause user confusion when changes that should make no difference cause a significant performance problem.
It is terribly slow, yes. We have 76 deps in our production lockfile (which takes 2 minutes to pip download). I don't think that the bad performance is pips fault, but it's definitely only visible because pip is overly aggressive in its requests.
Yeah, that sounds very much like a problem that pyTorch should solve with their index. Checking that a URL doesn't exist should be quick. The fact that you're getting 403 Client Error: Forbidden for url
rather than a 404 suggests that the index is doing something more complex than just checking that the project exists.
Maybe the index should implement a fast check for whether the project exists, before doing whatever permission stuff it is that's so slow.
The more I hear about this, the more I think it's not something we should be working around in pip. Especially as the pytorch project looks like it has far more resources available than pip does...
Personally I'd be against hard coding PyPI, or heuristics like different behaviour for --index-url. It's too likely to cause user confusion when changes that should make no difference cause a significant performance problem.
I'm happy to take other suggestions! I'm not familiar with the codebase. If there's a need to specialcase one index then I'm happy to do it. I was just spitballing some ways of achieving that without having read a single line of the existing code.
Yeah, that sounds very much like a problem that pyTorch should solve with their index. Checking that a URL doesn't exist should be quick. The fact that you're getting 403 Client Error: Forbidden for url rather than a 404 suggests that the index is doing something more complex than just checking that the project exists.
Actually, I think they're literally serving it from a S3 bucket which returns 403 by default. By the way S3 permissions work; 404 means "this doesn't exist, which you know because you would see it otherwise" while 403 means "I can't tell you if that exists". While I'm sure they could make the listing permissions wider, there is nothing in PEP503 that says they have to return anything for URLs that do not exist.
Maybe the index should implement a fast check for whether the project exists, before doing whatever permission stuff it is that's so slow.
The more I hear about this, the more I think it's not something we should be working around in pip. Especially as the pytorch project looks like it has far more resources available than pip does...
I mean, I'm 100% in favour of torch solving all of their issues with publishing, of which a slow index is just one thing I can complain about. From my perspective though, I'm able to fix issues in pip, which I'd severely doubt for torch's package index. And fixing torch is just one of many package indexes that may have whatever latency that makes this really slow. I cannot even for sure say that their index is that slow or just remote. I have a ~600ms response per request. If their server is in the US, ~300ms of that would be just literal physical distance. And that implies at least half the solution would be me relocating...
they're literally serving it from a S3 bucket which returns 403 by default. By the way S3 permissions work; 404 means "this doesn't exist, which you know because you would see it otherwise" while 403 means "I can't tell you if that exists".
AWS S3 doesn't return a 403 by default, it returns a 403 if the bucket permissions haven't been configured suitably for a public static assets use case (by default a bucket is private, so the permissions they have are whatever they've set up themselves). To make sure non-existent assets return 404, the bucket owner needs to allow the s3:ListBucket
and not just s3:GetObject
permissions.
I hit this bug with a private package repo: adding --extra-index-url https://private-repo
in requirements.txt causes pip to request packages from private-repo
, get a 404 response, and only then fall back to PyPI
My private index advertises one package, and the documentation for --extra-index-url
seems to contradict observed behavior:
--extra-index-url <url>
Extra URLs of package indexes to use in addition to --index-url.
At least to me this reads like --extra-index-url
would come after --index-url
in the hierarchy, not before it.
One small correction—pip does not “fall back” to PyPI; it’ll query PyPI even if the request from the other index does not return 404. That’s how supplying multiple indexes work. If you want more control over which index to access on which package, consider putting a index router in front of the multiple indexes, and configure routing on that level instead.
Possibly relevant discussion: https://discuss.python.org/t/18242
Description
Hey!
Posted this as a question on IRC but got no response at all and can't find any existing issues, so proceeding with a bug report. I'm noticing a significant slowdown when using
--extra-index-url
with PyTorch's repositories. After digging into it I'm noticing a lot of error 403 Forbidden for individual projects that only exist on PyPi. The 403 makes sense, but given that--extra-index-url
requires a PEP503 compliant index those should not be made -- the root page lists all available projects and should be preferred.To push this to the extreme, I did a very simple test where I took a small random set of deps from PyPi top 10 + some of my actual dependencies. Importantly, no large torch wheels - biggest is numpy. The "base" case only takes ~5 seconds, but then adding a torch index adds >20 seconds to that! With a larger set of dependencies (from our actual codebase), each Torch repository adds ~40 seconds, while running without those only takes 15 seconds.
Spending an hour trying different combos, I generated the below graph. This is only using
torch
extra indexes, because they're extraordinarily slow, but I think Pip greedily querying each package is the actual issue here. If it just fetched the index once it'd get rid of all these extra calls and have no slowdown.Expected behavior
PIP does not needlessly query project pages for projects that are not listed on a PEP503 root page.
pip version
23.3.1
Python version
Multiple
OS
WSL/Linux
How to Reproduce
pip download -vv --extra-index-url https://download.pytorch.org/whl/cpu/ attrs
Output
Code of Conduct