Blocking ingress from datacenters

minrk commented 3 years ago

We've seen a recent increase of abuse activity originating from datacenters (mostly AWS). We can use network policies and/or nginx block-cidrs configuration. This is almost pretty easy because we can get datacenter CIDRs from the various hosts, as done here.

The catch

we run in these datacenters ourselves!
we run some endpoints (e.g. analytics.mybinder.org) that should be accessible from within binder instances
the federation redirect talks to all federation members
the world only connects via our one ingress controller, which applies the same policy to all destinations
talking to binder from CI may also fail, depending on where CI services are hosted (Azure DevOps will certainly stop being able to talk to Binder)

So we can pretty easily block ingress from data centers, but this will have the effects:

binders cannot access analytics data (bad)
federation health checks won't work (really bad)
federation proxy won't work (everything is gone...)

Possible ways around

nginx does allow putting conditions on blocked cidrs, but the convenient configmap used by the ingress-controller doesn't. It seems to be technically possible with a custom template but it's unclear how fragile that will be and appears to be approximately undocumented how to do it, only that it can theoretically be done.
We can deploy two ingress controllers using alias, and apply the stricter policy only to one, so our strict policy only applies to requests heading for binder and not the static resources
solving federationRedirect / proxy is doable if egress traffic from GCP comes from a stable ip. This seems doable, but complicated, mainly because we must use a 'private' GKE cluster. I think this means deploying new clusters from scratch, since the 'private' field of clusters is immutable.

Short-term partial solutions

We can actually block AWS right now, since it happens to be the biggest source of bad actors and also we don't happen to have a federation member on AWS. This may be the best short-term balance of fixing things vs spending time figuring stuff out.

manics commented 3 years ago

How easy is it to get the percentage of all launches originating from datacenters?

Instead of a hard block in Nginx could we put the IP blocking in BinderHub, and give the user a nice 403 error message? Optionally with support for a header or request parameter to identify traffic from the federator or other allowed source (security by obscurity since someone could figure it out, but might be enough?).

minrk commented 3 years ago

Excellent point re: implementing it in BinderHub. That also makes the federation problem go away because we can simply skip the check in the health endpoint.

Similarly, if we find we need to, we can use an auth token in secrets for our deployment tests.

We can't as easily block traffic to the Hub as we can with BinderHub, but since we are more concerned with folks abusing free compute resources as opposed to attacking other users' sessions, cutting them off at BinderHub should be good enough.

It's certainly more costly to execute the checks at that level, but we aren't talking about DoS-scale attacks, just a few disrespectful individual taking advantage of a free service.

minrk commented 3 years ago

How easy is it to get the percentage of all launches originating from datacenters?

Not super easy, since we don't store ip addresses. We could load the datacenter map and record the datacenter name in binder analytics archive, but that would only be for the future.

minrk commented 3 years ago

https://github.com/jupyterhub/binderhub/pull/1262 implements this at the BinderHub level, which I think is going to work better for us.

I think we still want most of #1829 in this repo for the fetching of datacenter CIDRs, but the shape of loading the cidr list will be a little different if we apply them at the BinderHub level.

minrk commented 3 years ago

With #1829 and #1848, we should now be blocking requests from AWS and GCP. Azure isn't blocked because our own tests run on GitHub Actions, which in turn runs on Azure, would be blocked. At least we got confirmation that blocking works!

I think we need some kind of token-auth to allow specific requests to be allowed from blocked ip ranges, which we could put in github secrets.

ellisonbg commented 3 years ago

We have started to get reports from JupyterLab/binder users who unable to use binder when running 1) running VMs on AWS or 2) when running VPNs that are deployed on AWS. Here is one such report:

https://github.com/jupyterlab/jupyterlab/pull/9622#issuecomment-800617800

I also ran into this myself last week as I am employed by AWS and my laptop has a VPN that is (no surprise) deployed on AWS.

betatim commented 3 years ago

For the moment the fact that VPNs with endpoints at one of the big cloud providers are blocked is a feature, same for not allowing people to connect from cloud VMs (this was the abuse scenario which triggered the blocking of cloud provider IPs).

We discussed the "but my VPN ends at AWS" scenario and concluded that probably most commercial (-> largest number of users) VPNs work hard to have exit IPs not associated with cloud providers as that would make them easy to block (a main use case of these VPNs seems to be circumventing geo restrictions of streaming providers). This does mean we deliberately decided to block those who use corporate VPNs (I think many of those end at a cloud provider), run their own VPNs (mine ends at a cloud provider) as well as people who use cloud VMs as "remote desktops". Now we are waiting to see how many people complain. So it is good to get these reports.

I think we need a way to make exceptions or allow people to do something to unblock themselves as well as our CI services not getting blocked, but it isn't clear to me what to do (accounts? proof-of-human'ness? white lists?) that is both effective at blocking abusers and doesn't overwhelm the operations team :-/

ps. I tried to find our dicusson about who'd be the casualties of blocking but I couldn't. It is somewhere in one of our repos though :-/

manics commented 3 years ago

Could we run a second GitHub authenticated BinderHub alongside the main one, with any GitHub account allowed to login? Other than the authentication it would be identical to the other deployment. Requiring a GitHub login will hopefully be a barrier to anyone abusing mybinder, and if they do those users could be blocked.

manfromjupyter commented 3 years ago

The bad actors are just going to move to others that aren't AWS, it's just a matter of time. I don't know much about the problem binder is facing but could you make it just painfully slow for AWS users, require a password that users need if they are using one of the unsupported datacenters, add a allow/white list, or anything else you can think of if a long-term solution for this is not coming to mind or already in the works.

My bias is I use AWS to test all of the accessibility task force's team's features and fixes and was going to develop things as well for said accessibility initiative, but am blocked from doing all of this until then. I can't not use AWS because it's the only thing my employer supports and all of my paid-for tools to test only have licenses on my AWS instance and they will not pay thousands more for the next environment and licenses that will just get blocked next. I just can't move to another system especially if the largest/most-widely-adopted at 31% of the entire cloud market isn't supported. Much easier to be a member of the mob than it is to be the one fixing things, and for that I'm sorry; just wanted to "bump" without it just being seen as annoying.

minrk commented 3 years ago

@manfromjupyter I appreciate the frustration, and thanks for describing your use case. I don't fully understand how mybinder.org is critical to working on JupyterLab things, though. That seems like a problem in workflow, and perhaps a mis-use of Binder. If you are already running on a VM, why not run JupyterLab there? I imagine that would be much more efficient, and give you control.

If nothing else, you should be able to run the same repo2docker command that Binder uses for any given PR ref to get the same environment without any restrictions.

manfromjupyter commented 3 years ago

@minrk, currently I'm not using it to develop, I'm using it to test other people's work. If that's not the purpose of binder, forgive me, just started doing all of this late last year. When I do start developing, other people who need to test it will be on AWS. I want to have a blind person test it that just requires them to click a button to launch it, not go through the laborious process of standing up an environment (especially for a non-technical person). I acknowledge there are bigger problems you guys are dealing with at the moment, just request a workaround for AWS, if it's at all possible.

Also as a side note, JupyterLab's documentation website lists binder as the preferred way to contribute (source), at least that's my takeaway from this because it's the very first thing/section mentioned before downloading packages and such to setup your environment. If this is not the preferred way, maybe we should tell somebody to put the preferred way first.

minrk commented 3 years ago

OK, thanks for clarifying. Making an exception for AWS is tricky, since it is by far the biggest source of all recent abuse of mybinder.org. Unblocking AWS might as well be unblocking everything.

could you make it just painfully slow for AWS users

We already do! Any miners who put just about any time into circumventing our silly whackamole are working for less than US minimum wage, but they are costing us and other Binder users a lot more than they are earning.

As @manics said, an authenticated binder would ~eliminate this problem as we'd have a much better mechanism for banning bad actors. mybinder.org doesn't have this at the moment, but GESIS does, though it requires registration. Keeping miners out of free, anonymous compute is a pain, and legitimate use always gets caught up with blunt tools.

In the short-term, if you have docker, you can run a 'local binder' to do the same thing Binder does without any restrictions with:

python3 -m pip install jupyter-repo2docker
python3 -m repo2docker --ref add-roles https://github.com/marthacryan/jupyterlab

for https://github.com/jupyterlab/jupyterlab/pull/9622 (--ref is the branch name, and https://github.com/marthacryan/jupyterlab is the URL of the fork)

I'm not sure if that suits your workflow or not (if you have users forced to be on AWS, presumably that means you have some control over the environment?). A 'local binder' launcher that looks up a PR and runs that one command might be useful. Something like this:

#!/usr/bin/env python3
import sys
from subprocess import check_call

import click
import requests

@click.command()
@click.argument("pr")
@click.option(
    "--repo",
    default="jupyterlab/jupyterlab",
    help="The repo against which the PR is made.",
)
def repo2docker_pr(pr, repo):
    # fetch info for the pr
    r = requests.get(f"https://api.github.com/repos/{repo}/pulls/{pr}")
    head = r.json()["head"]
    # get the git url of the fork
    repo_url = head["repo"]["git_url"]
    # and the current commit of the pr
    ref = head["sha"]
    command = [sys.executable, "-m", "repo2docker", "--ref", ref, repo_url]
    print(" ".join(command))
    check_call(command)

if __name__ == "__main__":
    repo2docker_pr()

JupyterLab's documentation website lists binder as the preferred way to contribute

I don't know that it's preferred, but I can see how you'd get that impression, since it comes first. I think implicit in the "from within the browser" is "if you can't set up a local development environment," but that should probably be made explicit, especially since it's such a slow iteration process. I wouldn't recommend doing that unless it's really not technically feasible to install jupyterlab locally following the directions immediately below.

betatim commented 3 years ago

Thanks @manfromjupyter for taking the time to tell us! We need these kinds of reports/bumps because otherwise we don't have a good way of knowing what collateral damage we are causing.

Being able to run repo2docker locally (on your AWS instance) and the snippet Min posted are nice. Also take a look at the GESIS instance. Besides being auth'ed it also gives you more RAM which might be useful when dealing with jupyterlab.

Pondering the idea of an authenticated mybinder.org and that we had to turn off/whitelist all of Azure(?) to allow our automated tests to work. Could we add a "token based auth" with less effort than setting up a parallel authenticated hub? I was thinking a token derived from some shared secret or explicit list of valid tokens that is either sent in the Authorization: header (when used by CLI/tests or a special cookie for human users like @manfromjupyter. Having an explicit list of tokens isn't mega scalable and some kind of shared secret derived value is prone to (eventually) getting hacked by abusers. But it would solve our CI issue and help out people who explicitly come to us asking for an exemption. What do people think?

minrk commented 3 years ago

Could we add a "token based auth" with less effort

A simple dict of hashed tokens that bypasses certain things like quota checks or ingress sources should be pretty simple. It's not the most general/sustainable solution but it solves a real problem we have today, so perhaps worthwhile in the short term.

isabela-pf commented 3 years ago

Further work at #1309.

betatim commented 3 years ago

Should we remove the blanket ban of cloud provider IPs now that we have the build token? It would let us find out if the token by itself is good enough to deal with abuse. This would be great news as it unblocks people who use cloud instances as proxies.

choldgraf commented 3 years ago

+1 from me since a significant-enough group of folks said this blanket ban is a problem for them

betatim commented 3 years ago

With #1985 we should now no longer block ingress from data centers.

If someone has the ability to test this and help debug it if it doesn't work that would be great. (deploy will probably take a few more minutes from now)

choldgraf commented 3 years ago

maybe @ellisonbg can report whether he's getting a drop in the reports that he noted above?

jupyterhub / mybinder.org-deploy