rubygems / rubygems.org

The Ruby community's gem hosting service.
https://rubygems.org
MIT License
2.31k stars 916 forks source link

Consider relaxing levenshtein distance rules #2058

Closed NuckChorris closed 4 years ago

NuckChorris commented 5 years ago

I just tried to push a gem named strait (named for a narrow, defensible waterway!), but was told that it is too close to typo-protected gem named train.

This is pretty far away from typo territory. For any short, memorable name (in English especially) it's going to be rather difficult to not be within a close levenshtein distance of an existing gem.

Are typo-squatters an actual real issue? Does anyone accidentally typo gem 'r4ils'? The download counts for all the typo-squatters are... pretty low? I don't have access to the Rubygems database but I could imagine those 200 downloads are somebody downloading all gems in existence or the typo-squatter testing their own typo-squat. If we're absolutely certain this is necessary, then Damerau-Levenshtein might better capture "typos" with a lower threshold, since it supports transpositions.

As it is, it's pretty clear from glancing over the list in #2037 that the current rule is overzealous: it has a minimum 31% false-positive rate! The "legitimate" ones which would be caught by this filter are almost 1% of all total gems. There's clearly-legitimate things like monet vs mongo or samlr vs haml.

But seriously, as somebody who does a lot of similarity-based text processing over larger, messier databases than this: please don't try to automate this. I always come to regret it, every time. Without fail. And adding "popularity checks" has only ever made things worse, not better.

Alternative solutions:

  1. Human reporting! Let them pick a reason from a dropdown, make it two clicks, don't bury it in a link pile like it is right now. I'm down to help on this, it's pretty clear that we need something better than "Post to this feedback forum which is filled with spam and gets 4 posts a month". I know it's hard to do support on a volunteer basis. Let's make it easier!
  2. Publish a feed of recently-created gems with easy reporting links — wikipedia does this and it works great!
indirect commented 5 years ago

@NuckChorris sorry you ran into this! It is, in fact, a serious problem, and we regularly receive emails from developers either saying things like "oh no! I just noticed that rials will root your machine on installation" or things like "oh no! I typoed while installing rails and later discovered my machine has been rooted for a while".

The current rule is overzealous, and we plan to work to tune it. Waiting for people to notice that they have already been rooted by typo-based malware doesn't feel like a viable answer at this point.

indirect commented 5 years ago

(And to be clear, it's less support work for us to whitelist your gem, and everyone else's gem, than it is to keep allowing typo gems.)

NuckChorris commented 5 years ago

Definitely agree that transposition is a reasonable case, but I think perhaps Levenshtein is just a poor fit. It's just a pure mathematical concept of language, not a real understanding of the linguistics at play here

Could we modify Levenshtein-Damerau to weight additions, deletions, transpositions, and substitutions differently? I feel like when you account for English, additions just make new words while transpositions/deletions are the main source of typos, and substitutions are the secondary source

indirect commented 5 years ago

Yes, I think something with higher weight on transpositions, and something with higher weight on similar beginnings would be a better fit!

The paper A Comparison of String Distance Metrics for Name-Matching Tasks seems likely to have useful suggestions on how we could improve our matching; I was planning to try to apply it at some point in the future. If you're a subject-matter expert and interested in helping, you're probably a better person for the job than me. :)

NuckChorris commented 5 years ago

I'm afraid my experience is more in throwing these algorithms at the wall and see what works best for cleaning up data, not really in the academic side of it...

As for the common prefix, Jaro-Winkler might be a good fit since it puts heavier weights on the beginnings and can prefilter in the database

rubyFeedback commented 5 years ago

I sort of agree with NuckChorris in the sense of usability issue versus security/protection issues.

I think typo-squatters is an issue but a somewhat minor one overall; the larger situation is how to deal with malicious folks in general. JavaScript has become quite famous in this regard (as a negative example).

While the above example is indeed problematic (strait to be assumed to be too close to train), I think the general aim to make the ruby ecosystem better is a laudable approach - just the overall usability should also be considered.

To the topic of levensthein distance - I am not really good with algorithms but the levensthein distance is used a lot all over everywhere; e. g. sequence alignment in bioinformatics. Third top downloaded gem is diff-lcs (as can be seen from https://rubygems.org/stats leading to https://rubygems.org/gems/diff-lcs); while this does not directly have a lot to do with levensthein as such, it is ultimately quite a similar task to do: to find subpatterns within two strings (e. g. equality or measuring differences). I assume this is quite common in informatics/programming in general, due to real problems - or causing real problems, too, as can be seen here. :D

yb66 commented 5 years ago

Why not push responsibility onto the installer? (I mean human and Rubygems/Bundler etc) If a check is run on install that says:

Did you know that these gems have names that are very close to others?

strait - did you mean "train"? rials - did you mean "rails"?

If this is what you meant then please run install again with the --ignore-typoes switch

No need for centralised whitelisting, everyone can upload their gems, everyone gets to make sure they're downloading the correct gems. The checker can even be over zealous, or zealosity (heh) could be a setting, even the whitelists could be handled that end.

/2pence

indirect commented 5 years ago

@yb66 that's too late. by the time a malware gem is installed, it's already run its payload and the computer is already compromised. telling someone after the fact that they made a typo and now their computer needs its hard drive wiped is not something we're okay with as a solution, sorry.

yb66 commented 5 years ago

@indirect

Fetching gem metadata from https://rubygems.org/.........

Very first line of every install phase, so the payload doesn't even need to be downloaded, let alone run, it just needs for the metadata to be checked for typos.

I'd appreciate it if you dropped the supercilious attitude from your replies to me in future, especially when you've come up with a horrendously poor rendering of what I proposed. Here's the contributor convenant, if you need a reminder.

https://github.com/rubygems/rubygems.org/blob/master/CODE_OF_CONDUCT.md

yb66 commented 5 years ago

In my annoyance I ran bundler install instead of gem install so that quote doesn't apply but the idea is the same.

I hope you get the consideration your issue deserves and isn't getting, @NuckChorris.

I'm out.

indirect commented 5 years ago

Sorry for making you feel condescended to, that definitely wasn’t my intention.

The phrase “why don’t you just”, combined with a proposal that sounds like it would require a new UX, lots of work, and only help the tiny fraction of users who update their clients quickly, is not a viable solution to this ecosystem problem. If you’d like to offer a more fleshed out proposal, we have an RFC repo for that.

As a side comment telling the maintainers that you know better than us what we should be doing, your suggestion and response to feedback comes across as not actually interested in helping the situation for everyone involved.

yb66 commented 5 years ago

@indirect Firstly, I didn't use the phrase "why don’t you just". Even if I did, it's normal speech, not an invitation for condescension. It's a discussion.

Secondly, I don't accept your apology because your intentions were quite clear, from your first response and with this "As a side comment":

your suggestion and response to feedback comes across as not actually interested in helping the situation for everyone involved.

This is downright disrespectful, again. I presented an idea of what I believe to be a viable solution. You may disagree but I've yet to see anything from you to suggest otherwise that is valid.

NuckChorris commented 5 years ago

Woah, could we all take a step back? I think nobody intended to come across rudely: @yb66 just suggested that maybe a better solution would be clientside, @indirect misunderstood and thought they meant a warning (not an error!), and things escalated from there...

I do think the clientside solution would seem to solve this quite nicely, but the issue of old clients is definitely a problem.

I was thinking "well maybe we just do server-side blocking for old clients" but that presents a whole new set of problems with "disappearing" gems. This isn't an easy problem to solve, and we should all calm down and acknowledge that none of us have the perfect solution and work together to figure out a better long term solution!

(in the nearer term @indirect, I'd love if you'd whitelist strait for me, I never got a response on the rubygems.org support system! Do you need more help there?)

yb66 commented 5 years ago

@NuckChorris That's very diplomatic but, as a professional, do you allow other professionals to defame you in public? There's a code of conduct and this project receives money, it's not somebody's weekend hobby.

I wish you all the best with this but if client-side checking is too radical an idea then I don't hold out much hope.

indirect commented 5 years ago

@NuckChorris thanks for following up. I've opened a PR to create an exception list and add strait to it at #2092.

@yb66 I am sincerely apologetic, both for misunderstanding your suggestion and for making you feel condescended to. If you'd like to follow up on your client-side filtering idea, please open an RFC for the team to consider. This ticket isn't the right place to evaluate that kind of proposal. Thanks.

tcd commented 4 years ago

What is currently the correct way to go about requesting that a name be whitelisted?

NuckChorris commented 4 years ago

@tcd you might open a PR adding your project name to the list ;)

tcd commented 4 years ago

@NuckChorris I don't think there is an actual list in the repo. Looks like as of pull #2116, names are added directly to the database using the gemcutter:typo:exception rake task.

dwradcliffe commented 4 years ago

@tcd please open a help ticket on https://help.rubygems.org

tcd commented 4 years ago

@dwradcliffe Opened one four days six days a week 10 days ago.

tcd commented 4 years ago

Resolved my problem in a separate issue, sorry for cluttering up this issue with that.

rietta commented 4 years ago

This discussion appears to be moot considering that this protection was disabled and will soon be replaced with an alternative approach of PR #2341

sonalkr132 commented 4 years ago

Yes, we are not using Levenshtein distance for typo protection anymore. Sorry about the hassle everyone. It was a very popular request and we had multiple issues (and PRs) for implementing this. We wouldn't have known about its issue unless we tried. Possibly we could have disabled this sooner #2253