pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.59k stars 966 forks source link

Ultranormalization encourages name squatting #11139

Open orsinium opened 2 years ago

orsinium commented 2 years ago

Describe the bug

10498 introduced "ultranormalization" to prevent name squatting of package names similar to ones already registered:

requests.exceptions.HTTPError: 400 Client Error: The name 'l10n' is too similar to an existing project. See https://pypi.org/help/#project-name for more information. for url: https://upload.pypi.org/legacy/

While the initiative, in general, is something of major concern for PyPI (and any other big package registry), the implementation has a few painful drawbacks:

  1. It simplifies name squatting. For example, registration of the name lili (a French name) allows to additionally squat many other similar names, such as 1111, i111, i11l (could be a good name for internationalization package), i-11-l, iiii (4 in Roman numerals), and so on. In total, it's a huge amount of combinations, the exact number depends on the package max size and if you count names such as l-------ll--------l.
  2. It complicates name registration. A few months ago, I started my work on an l10n+i18n package. I always start by picking a name. A quick search on PyPI showed that the name l10n is free. A few months later, I have the package ready but PyPI rejects my upload. How could I know that the name is "taken"? Should I register the name before I have any code ready? Then again, that encourages name squatting.
  3. The name rejection reason isn't clear. It just says that the name is similar to another one. Which one? If I knew, I could claim it as per PEP 541. But it can't be completely solved by just showing the name. Before the change was introduced, there were registered multiple packages that wouldn't pass the check. That means if the name l10n is rejected by PyPI because there is a package i10n, claiming the i10n name would reveal that there is a package lion which the user would need to claim again. How many times can one claim names to register a single package? And if PyPI would show all similar names, would it be reasonable to allow mass name claiming? Then again, it's not much different from mass squatting. And if I could claim any name itself without claiming all collisions, wouldn't it defeat the point of the change altogether?
  4. It reduces the scope of available names. This one is similar to previous points but worth covering. As the Python ecosystem grows, so grows the list of registered names on PyPI, and so shrinks the scope of names available for registration. Less free names mean more frustration, harder name picking, and more awful names. You might know this frustration when you try to register a new account on, let's say, Reddit, but all usernames you ever used are already taken and in frustration, you just start slamming the keyboard trying to find just any random combination that would work. And now imagine that instead of nice readable names of packages you have such randomness in your dependency file, imports, and tracebacks. It's important to keep as many nice names for packages available as possible and the change goes against this initiative. Less good names available again means more effective name squatting.

Expected behavior

"What I see is what I get". If there is a package with this name, the name is already taken. You might claim it as per PEP 541 or pick another one. If there is no package with such name (and it's not in the stdlib), you can use it.

To Reproduce

Try to register l10n package. Or run test_fails_with_ultranormalized_names from the PyPI test suite.

My Platform

Irrelevant.

Additional context

Irrelevant.

Possible solutions

I understand the motivation behind the change but find it bringing more harm than good. To not be that person who only complaints about things, there are some solutions for the problem I see:

  1. From the obvious, just revert the change. It would solve all the issues I outlined but the problem with the name squatting will stay for further discussions (wouldn't it always stay unsolved anyway?).
  2. Reduce the scope of the change by checking collisions with only popular names. It makes sense to prevent registering names such as djang0 but at the same time there is no harm in having some not very popular or nearly abandoned packages collide. However, PyPI doesn't have a reliable metric of package popularity just yet. The downloads count is stored separately in BigQuery (and querying it for each name registration could be costly) and even then, the metric is pretty unreliable. GitHub stars count is an even worse indication of popularity and is available not for all packages.
  3. Allow registering colliding names but provide a warning in the Web UI that there are packages with a similar name.
  4. Write warnings about registering colliding names into an audit log. IDK if PyPI has any internal audits for this purpose but having one regularly might be a good idea if the registry security matters.
  5. Avoid implementing heuristics on the PyPI side and leave PyPI scanning to the bored (or paid) pentest companies.

Sorry for a lot of text. I don't want to fight against your vision of how the project should look like but I find this particular change harmful for both security and user experience.

orsinium commented 2 years ago

Here I use "name squatting" to indicate two slightly different attacks:

  1. Squatting of names similar to existing projects. The end goal usually is to distribute a malware.
  2. Squatting of nice-looking names. Usually, dictionary words or popular brands. The end goal usually is to sell the name later.

If there is a term to distinguish these two, it's not known to me. The difference, however, is quite small, and persuaded goals may mix. Both are somewhat of a concern for a package registry and both should be approached carefully without sacrificing one for another.

di commented 2 years ago

Hey @orsinium, thanks for the issue. I think we're unlikely to reverse this policy: this may not be apparent to PyPI users but this has significantly cut down on the creation of malicious packages attempting to similar-squat legitimate project names. It's generally made PyPI safer to use but also means we (PyPI maintainers) can spend less time dealing with these types of packages.

I think there's a few things we can do to make this policy easier to deal with, though:

orsinium commented 2 years ago

I like the last 2 points. Even if the presence of the feature isn't something that can be discussed, there are still ways to improve it:

  1. Show with which names exactly the requested name conflicts. I covered in the issue description why it might be a good idea.
  2. Set a threshold for the allowed text distance. For example, 1111 and l111 (distance 1) are similar but 1111 and lili (distance 4) are completely different names.
  3. Make threshold adaptive based on the project name. For example, python-dateutll is too similar to python-dateutil while ll and li are two completely different names. Both cases have the same distance 1 but their length (and so relative difference) is different.
  4. Do not apply the rule to some abandoned or unpopular projects. The original change is targeted against name squatting for the purpose of malware distribution and so makes sense only for packages that have names similar to one that people often use (and so may mistype).

When I was working on dephell, I had an idea to warn users if they try to install a package that looks like a more popular project but mistyped (https://github.com/dephell/dephell/issues/133). And to this day, I still think that allowing packages to have similar names but warning users about it could be a good idea. At least because it allows for an even more aggressive similarity search than the currently implemented ultranormalization..

Matthelonianxl commented 2 years ago

Copyright (c) 2011 Google Inc. All rights reserved.

Use of this source code is governed by a BSD-style license that can be

found in the LICENSE file.

"""Applies a fix to CR LF TAB handling in xml.dom.

Fixes this: http://code.google.com/p/chromium/issues/detail?id=76293 Working around this: http://bugs.python.org/issue5752 TODO(bradnelson): Consider dropping this when we drop XP support. """

import xml.dom.minidom

def _Replacement_write_data(writer, data, is_attrib=False): """Writes datachars to writer.""" data = data.replace("&", "&").replace("<", "<") data = data.replace("\"", """).replace(">", ">") if is_attrib: data = data.replace( "\r", " ").replace( "\n", " ").replace( "\t", " ") writer.write(data)

def _Replacement_writexml(self, writer, indent="", addindent="", newl=""):

indent = current indentation

addindent = indentation to add to higher levels

newl = newline string

writer.write(indent+"<" + self.tagName)

attrs = self._get_attributes() a_names = attrs.keys() a_names.sort()

for a_name in a_names: writer.write(" %s=\"" % a_name) _Replacement_write_data(writer, attrs[a_name].value, is_attrib=True) writer.write("\"") if self.childNodes: writer.write(">%s" % newl) for node in self.childNodes: node.writexml(writer, indent + addindent, addindent, newl) writer.write("%s</%s>%s" % (indent, self.tagName, newl)) else: writer.write("/>%s" % newl)

class XmlFix(object): """Object to manage temporary patching of xml.dom.minidom."""

def init(self):

Preserve current xml.dom.minidom functions.

self.write_data = xml.dom.minidom._write_data
self.writexml = xml.dom.minidom.Element.writexml
# Inject replacement versions of a function and a method.
xml.dom.minidom._write_data = _Replacement_write_data
xml.dom.minidom.Element.writexml = _Replacement_writexml

def Cleanup(self): if self.write_data: xml.dom.minidom._write_data = self.write_data xml.dom.minidom.Element.writexml = self.writexml self.write_data = None

def del(self): self.Cleanup()

domdfcoding commented 2 years ago

Having hit the error message myself I am at a loss as to which name my chosen name is too similar to, despite searching the list of all project names. I could spend all day playing "guess a valid name", but I'd rather not.

jedie commented 2 years ago

What's about to use the Levenshtein distance ?

EDIT: Oh there a few issues about "Levenshtein distance": https://github.com/pypi/warehouse/issues?q=Levenshtein+distance ;)

di commented 2 years ago

Yes, we tried that in https://github.com/pypi/warehouse/pull/5001, unfortunately it was far too noisy to be actually useful.

thatch commented 2 years ago

Which one? If I knew, I could claim it as per PEP 541.

I agree. I tried to register "checkreqs" or so, and it was considered too similar to an unknown existing project.

I don't see anyone against giving the squatted project name, either here or in #11872, so I should probably just send a PR?

SnoopJ commented 1 year ago

Which one? If I knew, I could claim it as per PEP 541.

I agree. I tried to register "checkreqs" or so, and it was considered too similar to an unknown existing project.

I don't see anyone against giving the squatted project name, either here or in #11872, so I should probably just send a PR?

My opinion carries no organizational weight, but I think it would be a nice improvement if PyPI could be issue a more specific error message than the current one, and a PR would represent a very actionable decision for the maintainers, +1 from me. This may be easier to track if the other issue is re-opened or if a new issue with a suitably narrow scope is opened, since this issue has other things going on.

(For the sake of context: I ended up on this issue after helping a user in #python on Libera.chat navigate the existing error message, which left them perplexed about what they collided with and what to do about it)