pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.51k stars 942 forks source link

Reduce Typosquatting Harm via Social Distancing for Top PyPI Packages #9527

Open jspeed-meyers opened 3 years ago

jspeed-meyers commented 3 years ago

What's the problem this feature will solve?

Reduce the total harm typosquatting causes to PyPI users.

Describe the solution you'd like

Block users from uploading new packages with a similar name to any of the current top packages.

Additional context

While similar solutions have been proposed before (see below), this particular solution is not a malware check or a predictive model, per se. It is a proposal for a simple rule that users cannot upload any NEW packages that have a similar name to a top package. Full stop. This solution does not stop all typosquatters, but it will likely reduce the harm because typosquatting the top packages will be harder. Typosquatters can either attempt to typosquat less popular packages and therefore harm fewer users or they can use typosquatting attack strategies involving a greater edit distance and also likely harm fewer users.

For reference, past analysis by @bentztozer and myself found that eighteen of forty past documented typosquatting attacks on PyPI had an edit distance of two or less. Similarly, the analysis found that twenty nine of the forty past attacks typosquatted packages that were among the 1000 most downloaded packages.

I should also mention that this proposed feature is not meant to replace a number of other ongoing and related efforts that try to reduce the harm caused by malware on PyPI. Finally, like all approaches to reducing harm from malware on PyPI, there are pros and cons. All debate and critique and suggested revisions are welcome.

Some parties I know will be interested: @di, @ewdurbin, @xmunoz, @benjaoming, @hannob Some parties who could be interested: @ewjoachim, @brainwane, @pradyunsg, @ncoghlan, @dstufft

Relevant issues and PR’s:

Implement a More Robust Malware Detector - Issue #7748

Detect Packages Being Published with Typo-ish Names Issue #4998

@brainwane rightfully mentioned that if this approach was part of a malware check it is virtually guaranteed that this approach would generate many false positives. This proposal is distinct since this proposal simply calls for a rule to restrict package name selection in the name of what might technically be called “preclusive namespacing” but what might informally be called “social distancing for top PyPI packages.” PyPI administrators can therefore avoid adjudicating whether a certain package is malware or not. PyPI will simply prevent any users--and ideally provide an explanation--that such a package name is not allowed in the name of reducing aggregate typosquatting harm.

Post-registration Alerts for Packages with Similar Names (Typosquatting) - Issue #2268

Monitor New Packages that Might be Typosquats - PR #5001

PSF Fundables - Productionize Malware Detection - Issue #38

di commented 3 years ago

Note that https://github.com/pypa/warehouse/pull/5001 was reverted in https://github.com/pypa/warehouse/pull/8807, as it was far too noisy for us to even use it as a tool for investigating potential squats, let alone do any automated blocks. Part of the problem is that maintainers prefer short project names, so even an edit distance of 2 produces a ton of noise with legitimate projects.

pradyunsg commented 3 years ago

One option is to use something like ceil(n / 5) for the edit distance, where n is the length of the package name being uploaded and 5 is a parameter for us to tweak.

This would allow for:

pradyunsg commented 3 years ago

FWIW, we don't have to use a linear function even. Something like min(ceil(n / 5), 3) would cap the distance at 3, if that's what we want.

jspeed-meyers commented 3 years ago

Maybe 5 is the magic number (at least to test this hypothesis)

Understood about the maintainer preference @di, thanks for highlighting the SpellBound research @Julian, and those are all great ideas, @pradyunsg.

With this feedback in mind, I took a closer look at the known typosquatting attacks dataset I collected with @bentztozer, and found that the 40 attacks identified had the following distribution:

Length of Original Package Name Count
4 1
5 3
6 10
7 6
8 3
9 2
10+ 15

This suggests that a minimum package name length of 5 for such a rule seems sensible.

The min(ceil(n / 5), 3) is also an interesting approach.

» What do you think, @pypa/warehouse?