RFC: CAPTCHA or similar for package uploads

ocramz commented 6 years ago

Motivated by recent spammy submissions: https://hackage.haskell.org/package/Facebook-Password-Hacker-Online-Latest-Version-1.0.1 (already reported in https://github.com/haskell-infra/hackage-trustees/issues/132 ).

tfausak commented 6 years ago

It might be nice to do this when you register for an account rather than when you upload a package.

hvr commented 6 years ago

We had something different in mind to kill two birds (including issues like #461 and other issues that haven't been publicly documented as they'd represent exploitable security issues) with one stone here: We're about to finally implement #558 (cabal already made the switch to upload as candidates by default in preparation of this) as part of the upcoming GSOC (if we don't get a student, I intended to implement it myself), and that gives us the ability to force package uploads by new accounts to go via a quarantine/vetting phase (giving us some other benefits like additional QA checks on package metadata and all that). It's been generally a known issue that the combination of a persistent index + unmoderated uploads poses a big risk in terms of DoS and spamming and other policy abuses. But we hoped we'd get the solution in place before spammer became aware of us. I guess we need to accelerate this now.

tfausak commented 6 years ago

Great! What did you have in mind?

tfausak commented 6 years ago

I am replying again because @hvr edited his comment drastically after I replied. It originally said:

We had something different in mind to kill two birds with one stone here...

I see how #558 is a nice quality of life improvement. It keeps the candidate index cleaner. However I don't see how it relates to this issue, other than the fact that it happens to touch on package candidates. Can you elaborate?

that gives us the ability to force package uploads by new accounts to go via a quarantine phase

I strongly oppose forcing all new accounts to wait for someone (presumably a trustee) to approve their package before moving it out of the candidate index. It seems clear to me that such a system would be a bottleneck. Perhaps the number of new users is small enough that it wouldn't matter. I don't know.

I do know that I would be peeved if I had to wait for manual intervention after running stack upload .. Also, as far as I know, other package indexes do not require this type of manual intervention. It might be worth researching how they deal with spam, denials of service, and other such attacks.

And for what it's worth, I personally never use the candidate feature. So that might color my opinion of using it as a solution to this problem.

gbaz commented 6 years ago

other than the fact that it happens to touch on package candidates

Well that's precisely the point. A candidate workflow lets us gate new uploads from untrusted parties. I agree that this would be a slight bottleneck. But the alternative is to not gate things, and then we get low quality uploads, like the spam packages under consideration.

The point about improvement to candidates is to make the workflow painless enough that this is an ok solution.

Captcha's won't help -- this was a manual upload process that we're dealing with, not a bot-driven one.

ocramz commented 6 years ago

I agree with @tfausak , I'm not in favour of having to wait for human intervention for a package upload. E.g. Often I split libraries in multiple parts (foo becomes foo-core, foo-accelerate, etc.) and import the pieces into a central one, and having to wait for all of this would be very tedious.

What if there were some sort of digital signature mechanism at upload time? A GPG key? Or some sort of invitation mechanism, like on lobste.rs ?

What best practices are there from other programming languages/package ecosystems?

gbaz commented 6 years ago

@ocramz the idea floated here is not for a general-purpose intervention on package upload -- just for the first upload from new accounts. it would have no effect at all on existing hackage users.

gbaz commented 6 years ago

An invitation mechanism is an interesting idea but I think it might gate too strongly -- judging by the low traffic of lobste.rs or also the stories about arxiv accounts. (On the latter, I think it works properly for them, because they're not afraid to overfilter to a degree, while we should really seek to avoid that here.)

hvr commented 6 years ago

Really, the simplest way to vet new accounts is to use their first uploads as a "proof of work" to see if they're able to write original Haskell and produce cabal packages (and I'd say, if a spammer goes the trouble to learn Haskell just in order to spam Hackage... I'd be impressed... then they almost earned it); this also has the benefit that we can reach out to newcomers early on and help them improve (as part of the Hackage Trustee's mission statement) if we spot problems in their packages before they hit the primary index (in order to avoid problems such #461 or also as those typical half-a-dozen release uploads within a single hour which end up in the public primary index and that every hackage client has to pay the cost for...). This would provide a good return of investment in my opinion.

tfausak commented 6 years ago

I see what y'all are saying. New accounts would have their first package upload quarantined in the candidate index until a trustee approves it. That makes sense and would put a human in the mix, so I can see how it might curb spam packages. I still don't like it, so I'm going to try to poke holes in the idea. Please think of this more as me playing devil's advocate than an attack.

It would make writing introductory material a little more difficult, and following along with such material would also get a little harder. Authors would have to explain the concept of package candidates and the quarantine. Newbies would have to wait to see their results on the non-candidate index.
Adversaries could overwhelm the approval process itself by making lots of new accounts and uploading lots of new packages into the quarantine. Trustees would have to sift through the accounts and packages to determine which ones are legitimate.
Trustees in general would have the difficult task of determining if someone is legitimate or not. For example, both of the current spam accounts (hejirumo and bobo8) uploaded test packages first. Would those test packages have escaped the quarantine? It's easy to look back and say "no way, those are bad", but making that judgement call in general will be difficult.
Adversaries could first upload a legitimate looking package in order to escape the quarantine, then start publishing spam. This is obviously harder than the current situation, so it would hopefully deter enough people to make it worth it.
Adversaries would be motivated to take over existing accounts that are not quarantined. That means lots of attempted logins or trawling for credentials. This is made worse by the fact that Hackage uses HTTP basic authentication and API keys aren't widely used. There are probably a lot of Hackage credentials out in the world.
As a consequence of the previous two points, adversaries could simply buy un-quarantined or hacked accounts. I don't know what the economics of spamming look like, so I can't say if this would be worth it or not.

I am very curious about how other more popular indexes solve this problem. Looking at npm as an example, it seems like they handle this on a case by case basis. Their policy is simple: Ban the user; clean up the mess. Maybe doing something similar is enough for now?

gbaz commented 6 years ago

The problem with "ban the user; clean up the mess" is that it means playing wack-a-mole with spammers. They can create new accounts and upload stuff rapidly with automation. Furthermore, package deletion is deliberately difficult because we essentially want the index to be append-only.

This is made worse by the fact that Hackage uses HTTP basic authentication and API keys aren't widely used.

Nope, digest auth only, for some years now.

Adversaries could overwhelm the approval process itself by making lots of new accounts and uploading lots of new packages into the quarantine.

This is true, but that's basically just the fact that the current dos-able situation moves the dos threat to a place where its at least somewhat less dangerous.

jkachmar commented 6 years ago

I feel like this constitutes a UX regression relative to other programming language ecosystems. (EDIT: "this" being the manual vetting process, not the reCAPTCHA proposal)

In my opinion, as a consumer of packages from Hackage with an interest in contributing to the ecosystem somewhere down the line, there should be an extremely compelling reason for Hackage to diverge from the common case here.

What makes Haskell's ecosystem so different from others that this particular plan of action is a better fit?

tfausak commented 6 years ago

Sorry, I didn't mean to misrepresent Hackage's authentication method. I'm not crazy about HTTP digest auth either, but it's not a problem. I should have said that users can interact with Hackage over unsecured HTTP, which means they are susceptible to MITM attacks.

Looking into npm more, it seems like they have an automated system for detecting spam. When that system fails, I can only assume they fall back to the "ban, clean up" policy. This incident report gives a peek behind the curtain:

We have developed systems to analyze package contents as they are published, as well as flag users with problematic posting habits or associations with previously detected spam. These flags are posted in a Slack channel for review by npm support staff. We then take a closer look at the details of why the user or package was flagged, and, when we feel it is appropriate, remove it from the registry.

They go on to say that they have removed "many thousands of [users and packages] in the last few months". Hackage is about 50 times smaller than npm (according to Module Counts). So that might give an estimate for how much spam Hackage may have to deal with.

Perhaps this is something Facebook, Simon Marlow, and Haxl could help with 😄

tfausak commented 6 years ago

So, for context, a temporary measure has been put in place: New Hackage users do not get package uploading rights by default. They must ask a Hackage admin (or trustee?) to give them the rights. This Reddit thread contains links to the relevant stuff on Reddit, GitHub, and the mailing list.

With that temporary measure in place, we are currently in the quarantine situation, more or less. New users can't upload anything. If they want to upload something, a human has to get in the mix. The process and UI for this is subpar, but the overall result is similar to the quarantine proposal.

I want to take a moment to thank the Hackage admins and trustees for responding quickly to this problem, potentially in their off hours.

That being said, I am not happy with the quarantine proposal, nor the temporary solution put in place. What they have shown is that it's very easy to disrupt Hackage. Five spam packages have effectively broken Hackage for new users.

Even with the quarantine proposal in place, eventually some spam packages will slip through. And eventually at least one of those spam packages will contain content that is illegal rather than simply unsavory. In that situation it seems like Hackage will be forced to delete the package. Is that accurate?

Assuming the previous paragraph is accurate, I think it's reasonable to extrapolate that the right course of action in situations like this one is also to delete the offending packages. I understand that Hackage is supposed to deliver an append only log, but that seems to me like a model that doesn't match reality.

To belabor my own point: In some situations, deletion is the only answer. Since deletion will need to be supported anyway, it seems like a better tool for solving this problem than the proposed quarantining of new accounts.

ghuntley commented 6 years ago

I am also strongly opposed to forcing all new accounts to wait for someone to approve. It's the wrong optics and adds burden on folks we should not be adding burden to.

A simple PR to the relevant parsers and templates that adds nofollow to links will eliminate the rewards spammers currently receive from this behaviour. Then energy can be focused on stopping them from getting in (much harder problem).

gbaz commented 6 years ago

"Five spam packages have effectively broken Hackage for new users."

No. They've put a slight kink in the new account registration process. The registration request page has been updated to inform people of the new workflow: http://hackage.haskell.org/users/register-request

This is not new behavior, in fact, but rather the way things were setup when hackage2 was first deployed. (And hackage1 had an entirely manually-driven registration process). I agree it is suboptimal, but I don't know of a better interim solution than the one now in place.

We should work out details on the quarantine proposal and have a separate discussion on it that goes through the pros and cons once it is further fleshed out.

On how to handle with existing bad packages, that should be discussed elsewhere -- c.f.

https://github.com/haskell/hackage-server/issues/112 https://github.com/haskell/hackage-server/issues/201

tfausak commented 6 years ago

You are right that there is a paragraph explaining the new process on the registration page. I wouldn't expect a user to notice it, but perhaps I'm wrong. On the off chance that they don't notice it, when they try to upload a package they'll land on a page that only says:

Forbidden No access for this resource.

At that point, I am presuming that Hackage would seem broken to them. Sure, it's their fault for not reading the entirety of the registration page, but that's a pretty weak cop out in my opinion.

Knowing how Hackage used to work is good for historical context. But we're talking about how it works now and how it should work in the future. I have used many package repositories and none of them required human intervention when creating new accounts or uploading new packages. I feel strongly that Hackage should behave similarly to other repositories in this regard.

I apologize if I am commenting on the wrong issue. I thought this was the right place to discuss how to prevent spam packages from being a problem on Hackage. #112 is for removing packages in general. #201 is for hiding deprecated packages. #461 is about figuring out what to do with valid but potentially useless packages. #558 is about improving the candidate package workflow. haskell-infra/hackage-trustees#132 is about dealing with the specific spam packages we saw recently. My understanding of this issue is that it's a place to talk about ways to prevent spam packages from being uploaded (where one solution is the quarantine proposal) and also what to do when spam packages are uploaded if the prevention fails.

I don't like the statement that we need to "work out details on the quarantine proposal". It assumes that the quarantine proposal is the only or best way forward. Even without knowing all the details, I'm opposed to that.

I can see how adding rel="nofollow" to user generated links might dissuade spammers from targeting Hackage. That seems like a good thing that should be done, but I doubt it would completely prevent spammers.

I apologize for this long comment. I wanted to address everything that was said. However, my main point is still this:

In some situations, deletion is the only answer. Since deletion will need to be supported anyway, it seems like a better tool for solving this problem than the proposed quarantining of new accounts.

In my opinion, a better interim solution is deleting the spam packages. It's obvious that they were all uploaded manually, probably by one person. The volume of spam is likely to remain low for the near future. The spam packages really only show up on the recent additions page. They don't negatively affect Hackage too much if they are left up for a few hours (or even days).

If you only reply to one part of my comment, please reply to this: Do you think that deleting these spam packages is reasonable? If not, why not?

hvr commented 6 years ago

I don't have time to go into details, but I'll quickly say: Having some level of support for package deletion is on the roadmap (we need something like that for exceptional cases, e.g. DMCA), but it cannot and won't be the primary/standard mechanism of dealing with it as it is not economical nor effective. Moreover, for other related technical reasons (I'll leave it to @gbaz to explain) we won't be able to delete packages for several months from now on; this alone rules this out as an "interim solution" but at the same time makes the other proposed scheme which attacks the problem closer to the origin appear like a more viable path and more likely to be implementable in a reasonable time-frame (unless it gets veto'ed or otherwise delayed...).

gbaz commented 6 years ago

I didn't realize you were proposing just deletion as an alternative to other things, and not just a complement to clean things up when all else fails.

No, deletion is insufficient. While the volume of spam was low on the one day it started to occur, we have no guarantee it would remain low, and no mechanism to keep it from increasing. Further, as more packages were deleted, who is to say things would not escalate and spammers would not just start redoubling their efforts -- spam itself can be automated very easily but spam detection can only be semi-automated. That asymmetry has a lot of consequences.

From years of trying to prevent spam on the haskell wiki, I have learned by difficult experience that unless you have an army of volunteers and automated tooling, detection-and-cleanup after the fact just doesn't suffice -- involving a human at some point in account verification is by far the solution with the least overhead. So the question to me is just "what is the lowest impact way to accomplish this."

gbaz commented 6 years ago

(Also, yes, deletion-from-the-index requires breaking the incremental-download chain and releasing a new index, which is signed from-the-start again, which is burdensome to clients. And furthermore: the logic in the hackage-security library for dealing with this was broken, as we discovered recently, and as long as downstream cabal-install clients haven't upgraded to the latest [which is not yet released], index-rewrites [always expensive, always to be avoided] are actually entirely out of the question.).

As far as for what "this ticket" is to discuss. I mean -- feel free to discuss general approaches here. However, I don't think it is correct to dismiss any sort of "new user quarantine" approach when the details of such an approach haven't been worked out yet -- the pros and cons of proposals are in the concrete.

tfausak commented 6 years ago

Here's how I can be opposed to quarantining without seeing a concrete proposal:

No other package registry that I'm aware of uses such a system. If other registries can work without a quarantine, it should be possible for Hackage to do so to, at least theoretically.
Package candidates are listed, just like regular packages, on the package candidates page. Candidates also have description pages, just like regular packages. For example, this garbage package has a page on Hackage. And it's indexed on Google!
- I feel like I should dig into this point a little more. The fact that package candidates are indexed means that they're just as useful to spammers as normal packages. A spammer could create a bunch of new accounts, upload a bunch of spam packages, and not care at all if the accounts are eventually denied and the candidates deleted. The damage has already been done.
- It is of course possible to make the candidate packages less appealing, either by using robots.txt or rel="nofollow" or something else. But if that's the solution for candidate packages, why not do that for regular packages?

I also want to clarify my suggested fix to this problem. I think that deleting spam packages and banning users (which I think for Hackage means removing them from the package uploaders group) is an adequate response to spam. Furthermore it's a necessary response to both illegal content and content that's costly to distribute. (Imagine that someone uploads a multi-gigabyte package. Hackage shouldn't distribute that.)

With such a system, finding spam is still a problem. Two solutions come to mind, and I think that both are appropriate:

Have some code that scans new packages and scores them for how likely they are to be spam. For example, perhaps they contain a lot of outbound links, or they use certain key words like "cheats".
Allow users to report spam (directly on Hackage, without making an issue on GitHub). This could work like the moderation queue on Reddit: Users flag stuff, super users review the flag.

This solution feels similar to your description of the Haskell wiki, so I am interested to hear how that ended up with account verification. Can you elaborate?

gbaz commented 6 years ago

Relatedly, investigation of what other package registries do should involve seeing if their solutions appear to be working: https://thenewstack.io/npm-spam-cleanup-briefly-zaps-legit-software-packages/

Note that the proposed way forward for npm basically involves having a for-profit company pay somebody essentially full-time to just combat spam.

In any case, let me propose a change in how we think about this discussion. We should not be saying "let's do X or let's do Y." We should be looking at a sum of measures, each with costs and benefits, and trying to use a combination of them. Fighting spam and abuse of our systems is not a binary proposition -- we can deploy lots of measures to improve things to some degree, and each has a cost.

So I agree it would be good to have a "moderation queue" only for "suspicious" packages detected by some sort of filter. (Where uploads by newer authors would be inclined to get higher scores, and older, more trusted authors have lower scores). Somebody would need to write the code for spam detection and scoring. And somebody would need to write the code for the queue. But if someone wanted to talk that on, I think it would be a welcome PR.

I also agree that it would be good to have nofollow added everywhere possible -- as I mentioned elsewhere that would be a welcome set of PRs. I say set because there are three places to add it -- package description rendering, haddock doc rendering, and markdown rendering.

I also agree that it would be good to excluded candidate pages from indexing using robots.txt. There is already a ticket for this in fact: https://github.com/haskell/hackage-server/issues/504. Again, a PR would be welcome. (In that case to move the candidate page locations so robots.txt can exclude them properly).

So that gives us three PRs that would independently be useful and good (at least). Maybe if we had all these measures, then yes, there could be a discussion if they sufficed. But we do not have these measures yet. We do have in the codebase a relatively straightforward way to manage granting of account permissions. So our approaches to spam mitigation for the time being are necessarily going to be shaped and structured by the tools we have available.

Speaking of which, here is another PR somebody might like to write, that would help make the current situation more explicit -- the welcome email sent to people when they register accounts could remind them that they need to request to be added to package uploaders. (And also the landing page when they click the link to confirm their email). Here is a way to make that even better: add an admin-configurable flag that changes all the documentation at once and also changes if people are auto-added to uploaders at once. That way, if we do decide we want to turn this off, we have a quick way to go into "lockdown" again if need be, in the face of an incoming spam attack.

(regarding "Imagine that someone uploads a multi-gigabyte package. Hackage shouldn't distribute that." -- by the way, we do have size limits on package uploads. They are rather generous, but they stop well short of that :-))

tfausak commented 6 years ago

Great! I feel like we've landed in a spot where we all more or less agree. Just to make sure we're all on the same page, I'm going to say things back to you in my own words to make sure it sounds good to you:

Hackage already has the candidate workflow, and plans are already in place to further develop that workflow, so any "official" work in the near future is likely to focus on that.
PRs are of course always welcome. In particular four have been proposed that would likely be merged quickly:
1. Add some way to rate the suspicion level of packages. And presumably some way to make admins/trustees aware of that rating.
2. Add rel="nofollow" to all user-generated links in package descriptions and module documentation.
3. Move all candidate stuff into its own top-level namespace like /candidates. Then add a robots.txt and exclude the candidate stuff from it.
4. Add wording about the quarantine situation to the welcome emails.
Dealing with (that is, deleting) illegal or gigantic packages is a lower priority and will be addressed after the candidate feature is fully baked. This comment notes that deleting packages from Hackage currently seems possible, so I'm still a little unclear on this one. I'm assuming that deleting packages is either tediously manual or not well supported by tooling. I would appreciate some further clarity on this, in particular why the spam packages were not deleted.

Thanks for taking the time to hash this out with me! I know that fielding suggestions from "armchair developers" is difficult. I recognize that I don't have any particular expertise with the Hackage codebase, so I was trying to share my opinions about how a package registry should behave in the abstract. I don't mean to throw a big queue of work your way, and I hope I haven't come off like that.

One last question I have is this (and I apologize if this isn't the right venue for it): How long will the temporary quarantine situation be in place? In other words, when will Hackage return to the free for all account registration process?

gbaz commented 6 years ago

This comment notes that deleting packages from Hackage currently seems possible

No. It notes that we can delete the tarball. That does not remove the package from the index, which is distinct from the physical storage of individual tarballs.

We cannot delete packages from the index at this time. There is a ticket for making deprecated packages more fully hidden. (I linked to it upthread). To delete a package means rewriting the index. To rewrite the index means that you need a full instead of incremental download. Currently, released versions of cabal will not fall back properly to a full download, and they will instead break. That bug has been fixed, but it is not yet in a widely released version of cabal-install. As such, deleting packages from the index is at least a year out.

How long will the temporary quarantine situation be in place?

Until we have a sufficient collection of other mechanisms to replace it. I have no estimate on how long that will be.

haskell / hackage-server

RFC: CAPTCHA or similar for package uploads #685