Open epage opened 2 years ago
Do you have any thoughts on how this would work with an HTTP index? I can't think of a way to support that without some kind of registry API.
Ever since I heard about the HTTP index, I've been concerned about this. I do not think that this and #10656 are isolated cases of needing to query whats crate names exist and that this is identifying a gap in the HTTP index. I suspect we'll need to have another file in the HTTP index that is the list of all of the crate names. Hopefully the frequency of new crate names is at a point where we still get caching benefits on that file. A flat list of crate names is 979kb. We could either organize it into a trie in a single file or in multiple, well defined files if we can't get partial updates of the file over HTTP.
A tree of files that are available in the index would also be important for index signing. With the hash information, it is far too big for crates.io to have a flat file. It is 100% something I think we should do. My feeling is that we should not hold up the performance benefits of HTTP indexes, for something we can add backwards compatibly.
This doesn't have to be done client-side at use time. It can be done by crates.io at publication time.
At minimum, crates.io could detect which crates have "confusable" names and add some sort of warning metadata to their (sparse http) index files. Note that cached/stale registry data isn't an issue here, because a typosquatted crate is by definition published later than the popular crate it pretends to be.
Levenstein distance probably works for 90% of cases, but I expect there will be edge cases and complications, so having ability to update the detection algorithm and add exceptions independently of Rust releases will be valuable, so doing this server-side on crates.io is better.
In terms of detection, you may also want to check variations with swapped words (web-actix
), synonyms and grammatical forms (logger
vs logging
), neutral suffixes like -rs
, etc.
It will be necessary to choose which crate is the good one, and which is the bad one, because you don't want to sow uncertainty and advertise typosquatters when users ask for the good crate. i.e. cargo add serde
shouldn't say "did you mean zerde
?"
This choice is tricky. Usually older crate is the right one, but there's an exception of git
vs git2
crates (fortunately in this case it's not malicious).
It can't simply be the more popular crate, because download numbers can be quickly and easily inflated. crates.io could make faking downloads harder, but overall it's a losing battle, especially when we're dealing with determined malicious actors here.
So the overall choice could be a mix of crate's age, age of its owners' accounts, manual moderation overrides, or maybe something fancy like owner reputation based on page-rank-like algorithm or cargo-crev web of trust.
so having ability to update the detection algorithm and add exceptions independently of Rust releases will be valuable, so doing this server-side on crates.io is better.
I can see doing this though it does increase the lift necessary to get this going and I suspect we could initially get away with a simpler, client-side and then scale up to the server-side approach. The client-side approach is sufficient for registry squatting (#10656 ). Independent of squatting, the logic needed for doing this client-side can improve the cargo search
results and help provide spelling corrections for cargo add
and cargo info
when a crate name doesn't exist at all. It could also act as a fallback when a registry doesn't support the full feature to help keep minimum feature set for an independent registry down.
Problem
A user might
cargo add fooo
when they meancargo add foo
and get the wrong crateProposed Solution
When adding a new registry dependency, warn of dependencies that are an edit distance of 1-2 away from the specified crate. We should probably report their descriptions to hint to the user if the typo is for a different purpose. If the user didn't pass
--offline
, ideally we'd also report download counts as a very low download count is a likely smell.Notes
We might also want this for
cargo search
(andcargo info
if/when that gets added, https://github.com/rust-lang/cargo/issues/948).See also https://github.com/killercup/cargo-edit/issues/172