Closed ehuss closed 1 year ago
Another option is to only do up to 3 queries, the original, all underscore, and all dashes.
I think this is personally reasonable. I was surprised when I saw what lengths it went through.
Here's a list of names on crates.io that use a mix of dash and underscore. There are 623 entries. That's not as many as I feared.
I did some more analysis on the names on crates.io. The currently limit of 1024 is definitely excessive, as nothing uses more than 8 dashes/underscores.
We'd need to set the limit to at least 16 in order to have fewer misses than the suggested algorithm of (original, underscores, dashes).
Request limit | Potentially missed crates |
---|---|
1 | 68282 |
2 | 21214 |
4 | 4437 |
8 | 897 |
16 | 177 |
32 | 37 |
64 | 12 |
128 | 2 |
256 | 0 |
What about doing the requests in parallel? This is also how cargo update
works, and a cargo update
to a normal-sized repository will create requests to a couple hundred crates, so number wise it's not that excessive (cat Cargo.lock | rg "^name =" | sort | uniq | wc -l
gives me 279 for the cargo repo for example). The largest number of _'s is 8 which gives 256 many requests. The problem might be the high ratio of 404s returned though, maybe some cloud providers don't like that.
If one wants to do "escalation" then one can first make a request for all names with 1 bit flipped, then wait for the answer, then if nothing is found make another request for all names with 2 bits flipped, etc, with increasing numbers of bits flipped. This is the binomial coefficients game so it decreases again after you reach the half, so maybe at that point one can make one last request with all the bit flippings exceeding n/2.
What about doing the requests in parallel?
Cargo does issue the requests in parallel. The motivation for investigating this was the high number of 404s that infra was seeing in logs.
@arlosi I see. Is there some public discussion about the 404 issue?
I suppose if only 623 entries are affected then it's okay to cut some corners and emit suggestion-free diagnostics in those cases, in all other cases the - and _ only queries will show up the right result.
Ideally one would have used the migration to http indices to migrate to using the canonicalized name (as suggested in 2020), but I guess now it's too late for that.
Is there some public discussion about the 404 issue?
Yes, on the crates-io zulip.
Sparse registries have the same format as a checkout of the git index. You can transition by checking out the git index and running a static file server in the resulting folder. That transition path was important enough to me to justify not fixing known problems with the index format. Feel free to blame me for that decision, especially as the decision was made knowingly.
Another decision that complicates things is that alternative registries are allowed to have multiple packages whose names only differ by -
/_
. Thanks to package renaming, you can even use both packages in the same project.
Thinking about it more, the currently stable implementation tries all kinds of files and returns the first one it finds. (All kinds of files include ridiculously incorrect things like a_/b-/a-b_foo
as fixed in #11936.) It then reads the available packages from that first file. Rows whose name
matches the value requested by the dependency get tried by the resolver, and if there are none the names that do not match are included as a suggestion. If I am correct, a registry could have multiple crates that differ only in -
/_
with them all listed in the same file in the index. It would work correctly, ignoring the inefficient implementation, even if that one file wast at something hilariously invalid.
Assuming all of this is correct, I suggest that:
_
._
) with rows for all packages whose name canonical eyes to the same thing.a-/b_/a-b_foo
and a_/b_/a_b_foo
files in both the git and sparse index.a-/b_/a-b_foo
from the sparse and removing a_/b_/a_b_foo
from the git.I am going to close this since #11937 and #12083 significantly reduces network requests. If there is something still missing after #12083, feel free to reopen :)
When cargo can't find a crate name in a registry, it will also query every permutation of
-
and_
replacements (up to 1024 times) to see if there is a different match. When this code was written, it was assumed it would just be a simple query to git, but now that the sparse index does a network round trip for each one, it probably isn't a good assumption that doing 1024 is OK.I think at a minimum it should lower the limit significantly.
Another option is to only do up to 3 queries, the original, all underscore, and all dashes.