jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 75 forks source link

Turing joining the Binder Federation: Part 2! #1154

Closed sgibson91 closed 4 years ago

sgibson91 commented 5 years ago

The proposal we wrote in #1124 was accepted! We now have an Azure subscription with $10k to deploy a cluster on to :tada: So this issue is documenting the next steps we'll be taking.

TODOs

I'm going to try and keep the naming conventions similar between the Azure and GKE clusters where possible.

Open Questions

I'll keep this updated as more things occur to me 😄

cc: @KirstieJane

betatim commented 5 years ago

For the subdomain: create a new issue in the team-compass repo (like https://github.com/jupyterhub/team-compass/issues/203, we don't have a template/procedure for this yet :-/). To actually execute the change we need Chris or Min. With the issue we can create a paper trail and officially decide to add the subdomain.

You will also need a domain for the jupyterhub, do you want that as hub.turing.mybinder.org (GKE style) or will the Turing hub have its own domain (OVH style)? Something to discuss in the subdomain issue.

Deployment: mirroring the OVH setup would be the way I'd go. So a new turing.yml plus some secrets and maybe some azure specific additions to deploy.py. Can you create an account on KeyBase (and verify it with at least your GitHub account) then I can send you the keys for the secret content.

sgibson91 commented 5 years ago

You will also need a domain for the jupyterhub, do you want that as hub.turing.mybinder.org (GKE style) or will the Turing hub have its own domain (OVH style)? Something to discuss in the subdomain issue.

Cool, will open the issue. I think hub.turing.mybinder.org will be fine.

Done in https://github.com/jupyterhub/team-compass/issues/205

Can you create an account on KeyBase (and verify it with at least your GitHub account) then I can send you the keys for the secret content.

Sure, I'll try and do that at some point today.

betatim commented 5 years ago

If you need a domain to test the setup with before we have the "final" details for the cluster let me know and I can assign a throw away subdomain from wtte.ch. If it is convenient to have a domain that can be updated more quickly than mybinder.org which requires someone in a different timezone. Or you register your own domain to host throwaway stuff :D

sgibson91 commented 5 years ago

Keybase account created and verified with GitHub 👍

sgibson91 commented 5 years ago

Update:

sgibson91 commented 5 years ago

Service Principal received! Will deploy the cluster soon.

betatim commented 5 years ago

What is a resource group? Is it a azure name for a kubernetes concept (namespaces)? OR a azure cloud thing?

Completely selfish suggestion: do you have time for a tour of (very!) basic Azure stuff during the team meeting? I'd reciprocate with a tour of the Google cloud UI, buttons and CLI commands.

manics commented 5 years ago

It's an azure cloud thing- a way of grouping resources (compute, storage, network, etc).

sgibson91 commented 5 years ago

Yes, a Resource Group is just a label. Computationally means nothing, but allows you to group together resources that are related. (Here, "related" means that I, as a human, know that these things are being used for the same conceptual project.)

Yes, I'm happy to give a tour during the team meeting, we could maybe do a specific zoom call for this too so there's more time for questions?

KirstieJane commented 5 years ago

(Just a small note to say 😻 😻 😻)

Should we add notes about comms etc to this issue? Or keep this technical and make a new issue to drum up lots of excitement 😉 ?

sgibson91 commented 5 years ago

Thank you! 💖 I think keep this one technical and a second one for comms 😄

betatim commented 5 years ago

A new meeting would be nice but also tricky because we'd have to find a timeslot for it.

Depending on how much is on the agenda for the next meeting I'd be happy to spend 20-30min of the meeting to listen and ask a few questions. When I wrote my earlier comment I was thinking of watching you setup a kubernetes cluster, install something on it, look at the logs, do something else, done. Something to take away the feeling of "oh wow, so many buttons and it all has different names to Google cloud. Ok maybe I need to block off a few hours to just figure out where I am."


New issue for comms sounds good.

manics commented 5 years ago

It's a lot of work and requires self-confidence, but if you're up for it you could record a screencast on your own and upload it to e.g. youtube? Could also be linked from the docs.

sgibson91 commented 5 years ago

We could have a 1-to-1 zoom call if you wanted, I may also be able to do a screencast at some point. But tbh, my usual Azure workflow is having the Azure CLI installed locally and running stuff from my terminal. Deploying the k8s cluster will be very similar to the JupyterHub docs, but I'll probably do it with autoscaling. There's also the docs I (try to) keep updated in the hub23-deploy repo.

I spend more time looking at kubectl logs than I do anything on the Portal.

betatim commented 5 years ago

Ok, that is already super useful ("using CLI most of the time, hardly ever click"). Let's see how we are doing for time at the meeting and if there is interest but I'd be happy to show people around https://console.cloud.google.com/.

Back to discussing "Turing joins the federation" :D

sgibson91 commented 5 years ago

Last comment on this topic is that I added it to the agenda for the meeting :tada:

Back to the proper topic! Turing is switching its subscription backend (which is more important if you're interested in billing than interacting with resources), so I think I will migrate the subscription before deploying the cluster. It's quite a lengthy process - took 6 hours to migrate a single VM 😱- so doing that before we get a load of resources set up will probably be easier.

sgibson91 commented 5 years ago

I finally managed to deploy a cluster! I'm going to do some tests with a basic BinderHub set-up before I properly integrate it. Lots of stuff has been migrating on Azure recently so I want to check it all still works.

sgibson91 commented 5 years ago

I'm experimenting with multiple nodepools: https://docs.microsoft.com/en-gb/azure/aks/use-multiple-node-pools

Other useful docs: https://docs.microsoft.com/en-us/cli/azure/ext/aks-preview/aks/nodepool?view=azure-cli-latest

sgibson91 commented 4 years ago

Where I'm currently at with the Turing federation cluster.

Running deploy.py turing turing locally produces this helm chart templating error:

Error: render error in "mybinder/templates/matomo/secret.yaml": template: mybinder/templates/matomo/secret.yaml:11:61: executing "mybinder/templates/matomo/secret.yaml" at <b64enc>: wrong number of args for b64enc: want 1 got 0

Which means it's looking for:

matomo:
  db:
    serviceAccountKey:

in secrets/config/turing.yaml.

What is that and how do I get one?

sgibson91 commented 4 years ago

For that matter, how come we have matomo as a top level key but it's not listed in the chart requirements? Where does this dependency come from?

choldgraf commented 4 years ago

Hmmm, I believe that Matomo was planned to be used instead of Google Analytics (maybe @yuvipanda set it up?) but I don't believe we are actively deploying it...somebody correct me if I'm wrong!

yuvipanda commented 4 years ago

It comes from https://github.com/jupyterhub/mybinder.org-deploy/tree/master/mybinder/templates/matomo. Along with all the custom stuff in https://github.com/jupyterhub/mybinder.org-deploy/tree/master/mybinder/templates.

We do have it deployed (https://mybinder.org/matomo/index.php) and collecting data. I was hoping to remove Google Analytics to give our users more privacy (See https://github.com/jupyterhub/mybinder.org-deploy/issues/725 for more info). I'm not super involved anymore, so I understand if folks wanna remove it and keep a hard dependency on Google Analytics instead.

sgibson91 commented 4 years ago

Thanks everyone! I don't mind if we keep it or scrap it, but I need to know how to set it up for the Turing cluster so I can remove it as a blocker. I'm going to try generating an auth_token here and see if that's enough.

sgibson91 commented 4 years ago

So one thing that seems to work was just leaving the serviceAccountKey field for matomo blank.

I'm now very close to having BinderHub installed on the Turing cluster, except deploy.py keeps timing out during the helm upgrade --install command 😫(related issue: https://github.com/helm/charts/issues/11904) So I may try just running the commands in deploy.py manually and doing helm install instead.

sgibson91 commented 4 years ago

Actually, all the pods are running except for the binder pod itself. kubectl describe output below the fold - basically having problems mounting volumes.

Binder pod ``` Name: binder-8478b6b6c5-x8n45 Namespace: turing Priority: 0 Node: aks-default-14930255-vmss000000/10.240.0.4 Start Time: Tue, 26 Nov 2019 14:06:35 +0000 Labels: app=binder component=binder heritage=Tiller name=binder pod-template-hash=8478b6b6c5 release=turing Annotations: checksum/config-map: 3b98386cb77627ae3a7d9990babb531d2f458ca96bc0bf260982d33d4ed09058 checksum/secret: c1c9e90aae368e4904d41f8208532e5a76fefa4bd265245f618fc79f8653ba39 Status: Pending IP: IPs: Controlled By: ReplicaSet/binder-8478b6b6c5 Containers: binder: Container ID: Image: jupyterhub/k8s-binderhub:0.1.0-456.7e32ac0 Image ID: Port: 8585/TCP Host Port: 0/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Limits: cpu: 2 memory: 1Gi Requests: cpu: 250m memory: 1Gi Liveness: http-get http://:binder/about delay=10s timeout=10s period=5s #success=1 #failure=3 Environment: BUILD_NAMESPACE: turing (v1:metadata.namespace) JUPYTERHUB_API_TOKEN: Optional: false GOOGLE_APPLICATION_CREDENTIALS: /secrets/service-account.json Mounts: /etc/binderhub/config/ from config (rw) /etc/binderhub/secret/ from secret-config (rw) /root/.docker from docker-secret (ro) /secrets from secrets (ro) /var/run/secrets/kubernetes.io/serviceaccount from binderhub-token-zps84 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: config: Type: ConfigMap (a volume populated by a ConfigMap) Name: binder-config Optional: false secret-config: Type: Secret (a volume populated by a Secret) SecretName: binder-secret Optional: false docker-secret: Type: Secret (a volume populated by a Secret) SecretName: binder-push-secret Optional: false secrets: Type: Secret (a volume populated by a Secret) SecretName: events-archiver-secrets Optional: false binderhub-token-zps84: Type: Secret (a volume populated by a Secret) SecretName: binderhub-token-zps84 Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 53m default-scheduler Successfully assigned turing/binder-8478b6b6c5-x8n45 to aks-default-14930255-vmss000000 Warning FailedMount 8m56s (x20 over 51m) kubelet, aks-default-14930255-vmss000000 Unable to mount volumes for pod "binder-8478b6b6c5-x8n45_turing(f49d4d3e-1055-11ea-a113-4eb282213ae6)": timeout expired waiting for volumes to attach or mount for pod "turing"/"binder-8478b6b6c5-x8n45". list of unmounted volumes=[secrets]. list of unattached volumes=[config secret-config docker-secret secrets binderhub-token-zps84] Warning FailedMount 3m (x33 over 53m) kubelet, aks-default-14930255-vmss000000 MountVolume.SetUp failed for volume "secrets" : secret "events-archiver-secrets" not found ```
yuvipanda commented 4 years ago

I think if we have matomo, we can just run it on the main cluster instead of doing that per cluster. Similar to our analytics stuff. How does that feel?

sgibson91 commented 4 years ago

@yuvipanda This sounds perfect! I do think we need to have a refactor of the configs (as per the discussion here) so that GKE-specific stuff doesn't present a blocker to other new federation members. I'd like someone who's a bit more familiar with what's what in all the various yaml files to help me on that though. So I don't break anything! 😂

betatim commented 4 years ago

@sgibson91 can you take a look at how OVH is setup because IIRC then there is no matomo running there either. So the "Matomo only on GKE" is already achievable

sgibson91 commented 4 years ago

@betatim OVH has matomo.db.serviceAccountKey set in secrets/config/ovh.yaml - plus the following (which I copied across for turing):

https://github.com/jupyterhub/mybinder.org-deploy/blob/ea18a85e0c046c246e29efb51b5f9f1b72598836/config/ovh.yaml#L176-L184

Whether it's "running" or not I don't know, but it's required in the config otherwise the helm chart fails during installation.

betatim commented 4 years ago

Ah ok. I think the matomo.enabled: false means the pods aren't running. My guess would be that the actual value in the configs doesn't matter much in that case. Maybe all of the configs and such need to be surrounded by if statements so that they don't even get looked at if matomo is disabled.

sgibson91 commented 4 years ago

That would be ideal. I'd like to get this process to the point where I'm not including random config just to make the helm chart happy.

My current issue is the deploy.py is timing out because the binder pod never gets beyond ContainerCreating stage due to it being unable to mount the secrets volume. See this comment: https://github.com/jupyterhub/mybinder.org-deploy/issues/1154#issuecomment-558671163https://github.com/jupyterhub/mybinder.org-deploy/issues/1154#issuecomment-558671163

sgibson91 commented 4 years ago

Thanks for the PR @betatim - just checking the next set of TODO's for this:

sgibson91 commented 4 years ago

Remaining TODOs:

betatim commented 4 years ago

The plan looks good. Agree that we want to keep the domains separate. I'd get the PR merged and cluster running, then slowly step up the quota and see what happens. For this we need a working grafana that shows the launch success rate. Do you have the admin PW for grafana.mybinder.org? Then we could add the turing prometheus as a datasource there and get all the panels for free.

The thing I'd look out for is errors related to the container registry as the traffic increases.

manics commented 4 years ago

What version of BinderHub is running on https://turing.mybinder.org/? It doesn't look like the latest.

manics commented 4 years ago

Looks like outbound egress isn't restricted to these ports: https://github.com/jupyterhub/mybinder.org-deploy/blob/105c474d1a704701daa7fe8aa9cba55e1e46b2bf/mybinder/values.yaml#L30-L40

betatim commented 4 years ago

1203 is the PR with config from which turing is deployed (manually).

sgibson91 commented 4 years ago

Looks like outbound egress isn't restricted to these ports:

https://github.com/jupyterhub/mybinder.org-deploy/blob/105c474d1a704701daa7fe8aa9cba55e1e46b2bf/mybinder/values.yaml#L30-L40

@manics I think this deserves it's own issue as that wasn't part of the config that I edited and is, therefore, perhaps a problem across all clusters?

sgibson91 commented 4 years ago

What version of BinderHub is running on https://turing.mybinder.org/? It doesn't look like the latest.

I'm just attempting git pull master && cd mybinder && helm dep up && cd .. && python deploy.py turing turing, but once again, helm is not playing nicely! (I wish it gave more useful error messages :( )

$ python deploy.py turing turing
The behavior of this command has been altered by the following extension: aks-preview
Merged "turing" as current context in ~/.kube/config

$HELM_HOME has been configured at ~/.helm.

Tiller (the Helm server-side component) has been upgraded to the current version.
Happy Helming!
deployment "tiller-deploy" successfully rolled out
Updating network-bans for turing
Starting helm upgrade for turing
Error: UPGRADE FAILED: a released named turing is in use, cannot re-use a name that is still in use
Traceback (most recent call last):
  File "deploy.py", line 233, in <module>
    main()
  File "deploy.py", line 227, in main
    deploy(args.release, "turing")
  File "deploy.py", line 176, in deploy
    subprocess.check_call(helm)
  File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '--namespace', 'turing', 'turing', 'mybinder', '--force', '--wait', '--timeout', '600', '-f', 'config/turing.yaml', '-f', 'secrets/config/common.yaml', '-f', 'secrets/config/turing.yaml']' returned non-zero exit status 1.
manics commented 4 years ago

The NetworkPolicy was added last year https://github.com/jupyterhub/mybinder.org-deploy/pull/699 so it should be included.

sgibson91 commented 4 years ago

Exactly, if it was added last year and is not effective then that's a separate issue to me incorporating the Turing into the federation? Or are you saying that they're not working on the Turing cluster but are on others? (I've just managed to set the Turing cluster on fire so can't test this right now.)

sgibson91 commented 4 years ago

Do you have the admin PW for grafana.mybinder.org?

@betatim no I've never set up grafana before, how do I go about retrieving it?

sgibson91 commented 4 years ago

Current bugs:

sgibson91 commented 4 years ago
  • testhub.hub23.turing.ac.uk has fake certificates whereas testbinder.hub23.turing.ac.uk has real ones, pretty sure I'm using letsencrypt-prod cluster issuer in both cases so I have no idea what's going on there
    • @consideRatio do you have any advice here? How do I find out if I was banned from Let's Encrypt?

I solved this by re-deploying with new A records and new secrets.

  • grafana pods are complaining about shared volume mounts, they can take a really long time to finally initialise (sometimes I have to manually delete them) - see pod description

I'm not sure if this is happening because the WiFi at the Turing is terrible this week (we're running a data study group and have a lot of people here using interwebs), it seems pretty variable as to whether the grafana pods switching over causes deploy.py to time out or not. I might try tonight on my own connection.

  • network policy and the egress ports aren't actually restricted (confirmed this was turing only on gitter)

@manics The hub is now at newhub.hub23.turing.ac.uk and the certificates should now be real. Can we check again if this is still an issue? If so, what do we need to do to solve this?

manics commented 4 years ago

I can still ssh out of `https://newbinder.hub23.turing.ac.uk/

kubectl -n NAMESPACE describe netpol should list the currently deployed network policies

sgibson91 commented 4 years ago

https://gist.github.com/sgibson91/ffe78df174bf1f3344ef5a86d47b6996

manics commented 4 years ago

Looks like the policies are created, next thing is to verify that the cluster implements them. https://docs.microsoft.com/en-us/azure/aks/use-network-policies#create-an-aks-cluster-and-enable-network-policy suggests it's optional, is there anything in Azure that tells you whether they're active on your cluster?

sgibson91 commented 4 years ago

I will try and get hold of @trallard today

sgibson91 commented 4 years ago

image

My guess would be this is where we can edit the network policies - but annoying that it's not automatically applied.

manics commented 4 years ago

I think those are security rules which are independent from the K8s rules. It's the equivalent of a "physical" firewall operating at the network level. Then the K8s network policies are in addition to these, and they're implemented at the software level inside each Kubernetes VM. Either can be used to restrict network traffic, but obviously only the K8s network policies will be managed through the helm chart deployment.