Open jspeed-meyers opened 3 years ago
Note that https://github.com/pypa/warehouse/pull/5001 was reverted in https://github.com/pypa/warehouse/pull/8807, as it was far too noisy for us to even use it as a tool for investigating potential squats, let alone do any automated blocks. Part of the problem is that maintainers prefer short project names, so even an edit distance of 2 produces a ton of noise with legitimate projects.
One option is to use something like ceil(n / 5)
for the edit distance, where n is the length of the package name being uploaded and 5 is a parameter for us to tweak.
This would allow for:
FWIW, we don't have to use a linear function even. Something like min(ceil(n / 5), 3)
would cap the distance at 3, if that's what we want.
Understood about the maintainer preference @di, thanks for highlighting the SpellBound research @Julian, and those are all great ideas, @pradyunsg.
With this feedback in mind, I took a closer look at the known typosquatting attacks dataset I collected with @bentztozer, and found that the 40 attacks identified had the following distribution:
Length of Original Package Name | Count |
---|---|
4 | 1 |
5 | 3 |
6 | 10 |
7 | 6 |
8 | 3 |
9 | 2 |
10+ | 15 |
This suggests that a minimum package name length of 5 for such a rule seems sensible.
The min(ceil(n / 5), 3)
is also an interesting approach.
» What do you think, @pypa/warehouse?
What's the problem this feature will solve?
Reduce the total harm typosquatting causes to PyPI users.
Describe the solution you'd like
Block users from uploading new packages with a similar name to any of the current top packages.
Additional context
While similar solutions have been proposed before (see below), this particular solution is not a malware check or a predictive model, per se. It is a proposal for a simple rule that users cannot upload any NEW packages that have a similar name to a top package. Full stop. This solution does not stop all typosquatters, but it will likely reduce the harm because typosquatting the top packages will be harder. Typosquatters can either attempt to typosquat less popular packages and therefore harm fewer users or they can use typosquatting attack strategies involving a greater edit distance and also likely harm fewer users.
I should also mention that this proposed feature is not meant to replace a number of other ongoing and related efforts that try to reduce the harm caused by malware on PyPI. Finally, like all approaches to reducing harm from malware on PyPI, there are pros and cons. All debate and critique and suggested revisions are welcome.
Some parties I know will be interested: @di, @ewdurbin, @xmunoz, @benjaoming, @hannob Some parties who could be interested: @ewjoachim, @brainwane, @pradyunsg, @ncoghlan, @dstufft
Relevant issues and PR’s:
Implement a More Robust Malware Detector - Issue #7748
Detect Packages Being Published with Typo-ish Names Issue #4998
@brainwane rightfully mentioned that if this approach was part of a malware check it is virtually guaranteed that this approach would generate many false positives. This proposal is distinct since this proposal simply calls for a rule to restrict package name selection in the name of what might technically be called “preclusive namespacing” but what might informally be called “social distancing for top PyPI packages.” PyPI administrators can therefore avoid adjudicating whether a certain package is malware or not. PyPI will simply prevent any users--and ideally provide an explanation--that such a package name is not allowed in the name of reducing aggregate typosquatting harm.
Post-registration Alerts for Packages with Similar Names (Typosquatting) - Issue #2268
Monitor New Packages that Might be Typosquats - PR #5001
PSF Fundables - Productionize Malware Detection - Issue #38