Closed sgibson91 closed 4 years ago
For the subdomain: create a new issue in the team-compass repo (like https://github.com/jupyterhub/team-compass/issues/203, we don't have a template/procedure for this yet :-/). To actually execute the change we need Chris or Min. With the issue we can create a paper trail and officially decide to add the subdomain.
You will also need a domain for the jupyterhub, do you want that as hub.turing.mybinder.org (GKE style) or will the Turing hub have its own domain (OVH style)? Something to discuss in the subdomain issue.
Deployment: mirroring the OVH setup would be the way I'd go. So a new turing.yml
plus some secrets and maybe some azure specific additions to deploy.py
. Can you create an account on KeyBase (and verify it with at least your GitHub account) then I can send you the keys for the secret content.
You will also need a domain for the jupyterhub, do you want that as hub.turing.mybinder.org (GKE style) or will the Turing hub have its own domain (OVH style)? Something to discuss in the subdomain issue.
Cool, will open the issue. I think hub.turing.mybinder.org will be fine.
Done in https://github.com/jupyterhub/team-compass/issues/205
Can you create an account on KeyBase (and verify it with at least your GitHub account) then I can send you the keys for the secret content.
Sure, I'll try and do that at some point today.
If you need a domain to test the setup with before we have the "final" details for the cluster let me know and I can assign a throw away subdomain from wtte.ch. If it is convenient to have a domain that can be updated more quickly than mybinder.org which requires someone in a different timezone. Or you register your own domain to host throwaway stuff :D
Keybase account created and verified with GitHub 👍
Update:
binder-prod
(equivalent to a GKE project). The location of this group (and hence all resources within it) is westeurope
.turingmybinderregistry
for image storageService Principal received! Will deploy the cluster soon.
What is a resource group? Is it a azure name for a kubernetes concept (namespaces)? OR a azure cloud thing?
Completely selfish suggestion: do you have time for a tour of (very!) basic Azure stuff during the team meeting? I'd reciprocate with a tour of the Google cloud UI, buttons and CLI commands.
It's an azure cloud thing- a way of grouping resources (compute, storage, network, etc).
Yes, a Resource Group is just a label. Computationally means nothing, but allows you to group together resources that are related. (Here, "related" means that I, as a human, know that these things are being used for the same conceptual project.)
Yes, I'm happy to give a tour during the team meeting, we could maybe do a specific zoom call for this too so there's more time for questions?
(Just a small note to say 😻 😻 😻)
Should we add notes about comms etc to this issue? Or keep this technical and make a new issue to drum up lots of excitement 😉 ?
Thank you! 💖 I think keep this one technical and a second one for comms 😄
A new meeting would be nice but also tricky because we'd have to find a timeslot for it.
Depending on how much is on the agenda for the next meeting I'd be happy to spend 20-30min of the meeting to listen and ask a few questions. When I wrote my earlier comment I was thinking of watching you setup a kubernetes cluster, install something on it, look at the logs, do something else, done. Something to take away the feeling of "oh wow, so many buttons and it all has different names to Google cloud. Ok maybe I need to block off a few hours to just figure out where I am."
New issue for comms sounds good.
It's a lot of work and requires self-confidence, but if you're up for it you could record a screencast on your own and upload it to e.g. youtube? Could also be linked from the docs.
We could have a 1-to-1 zoom call if you wanted, I may also be able to do a screencast at some point. But tbh, my usual Azure workflow is having the Azure CLI installed locally and running stuff from my terminal. Deploying the k8s cluster will be very similar to the JupyterHub docs, but I'll probably do it with autoscaling. There's also the docs I (try to) keep updated in the hub23-deploy repo.
I spend more time looking at kubectl
logs than I do anything on the Portal.
Ok, that is already super useful ("using CLI most of the time, hardly ever click"). Let's see how we are doing for time at the meeting and if there is interest but I'd be happy to show people around https://console.cloud.google.com/.
Back to discussing "Turing joins the federation" :D
Last comment on this topic is that I added it to the agenda for the meeting :tada:
Back to the proper topic! Turing is switching its subscription backend (which is more important if you're interested in billing than interacting with resources), so I think I will migrate the subscription before deploying the cluster. It's quite a lengthy process - took 6 hours to migrate a single VM 😱- so doing that before we get a load of resources set up will probably be easier.
I finally managed to deploy a cluster! I'm going to do some tests with a basic BinderHub set-up before I properly integrate it. Lots of stuff has been migrating on Azure recently so I want to check it all still works.
I'm experimenting with multiple nodepools: https://docs.microsoft.com/en-gb/azure/aks/use-multiple-node-pools
Other useful docs: https://docs.microsoft.com/en-us/cli/azure/ext/aks-preview/aks/nodepool?view=azure-cli-latest
Where I'm currently at with the Turing federation cluster.
Running deploy.py turing turing
locally produces this helm chart templating error:
Error: render error in "mybinder/templates/matomo/secret.yaml": template: mybinder/templates/matomo/secret.yaml:11:61: executing "mybinder/templates/matomo/secret.yaml" at <b64enc>: wrong number of args for b64enc: want 1 got 0
Which means it's looking for:
matomo:
db:
serviceAccountKey:
in secrets/config/turing.yaml
.
What is that and how do I get one?
For that matter, how come we have matomo as a top level key but it's not listed in the chart requirements? Where does this dependency come from?
Hmmm, I believe that Matomo was planned to be used instead of Google Analytics (maybe @yuvipanda set it up?) but I don't believe we are actively deploying it...somebody correct me if I'm wrong!
It comes from https://github.com/jupyterhub/mybinder.org-deploy/tree/master/mybinder/templates/matomo. Along with all the custom stuff in https://github.com/jupyterhub/mybinder.org-deploy/tree/master/mybinder/templates.
We do have it deployed (https://mybinder.org/matomo/index.php) and collecting data. I was hoping to remove Google Analytics to give our users more privacy (See https://github.com/jupyterhub/mybinder.org-deploy/issues/725 for more info). I'm not super involved anymore, so I understand if folks wanna remove it and keep a hard dependency on Google Analytics instead.
Thanks everyone! I don't mind if we keep it or scrap it, but I need to know how to set it up for the Turing cluster so I can remove it as a blocker. I'm going to try generating an auth_token here and see if that's enough.
So one thing that seems to work was just leaving the serviceAccountKey
field for matomo blank.
I'm now very close to having BinderHub installed on the Turing cluster, except deploy.py
keeps timing out during the helm upgrade --install
command 😫(related issue: https://github.com/helm/charts/issues/11904) So I may try just running the commands in deploy.py
manually and doing helm install
instead.
Actually, all the pods are running except for the binder pod itself. kubectl describe
output below the fold - basically having problems mounting volumes.
I think if we have matomo, we can just run it on the main cluster instead of doing that per cluster. Similar to our analytics stuff. How does that feel?
@yuvipanda This sounds perfect! I do think we need to have a refactor of the configs (as per the discussion here) so that GKE-specific stuff doesn't present a blocker to other new federation members. I'd like someone who's a bit more familiar with what's what in all the various yaml files to help me on that though. So I don't break anything! 😂
@sgibson91 can you take a look at how OVH is setup because IIRC then there is no matomo running there either. So the "Matomo only on GKE" is already achievable
@betatim OVH has matomo.db.serviceAccountKey
set in secrets/config/ovh.yaml
- plus the following (which I copied across for turing):
Whether it's "running" or not I don't know, but it's required in the config otherwise the helm chart fails during installation.
Ah ok. I think the matomo.enabled: false
means the pods aren't running. My guess would be that the actual value in the configs doesn't matter much in that case. Maybe all of the configs and such need to be surrounded by if
statements so that they don't even get looked at if matomo is disabled.
That would be ideal. I'd like to get this process to the point where I'm not including random config just to make the helm chart happy.
My current issue is the deploy.py is timing out because the binder pod never gets beyond ContainerCreating stage due to it being unable to mount the secrets volume. See this comment: https://github.com/jupyterhub/mybinder.org-deploy/issues/1154#issuecomment-558671163https://github.com/jupyterhub/mybinder.org-deploy/issues/1154#issuecomment-558671163
Thanks for the PR @betatim - just checking the next set of TODO's for this:
Remaining TODOs:
The plan looks good. Agree that we want to keep the domains separate. I'd get the PR merged and cluster running, then slowly step up the quota
and see what happens. For this we need a working grafana that shows the launch success rate. Do you have the admin PW for grafana.mybinder.org? Then we could add the turing prometheus as a datasource there and get all the panels for free.
The thing I'd look out for is errors related to the container registry as the traffic increases.
What version of BinderHub is running on https://turing.mybinder.org/? It doesn't look like the latest.
Looks like outbound egress isn't restricted to these ports: https://github.com/jupyterhub/mybinder.org-deploy/blob/105c474d1a704701daa7fe8aa9cba55e1e46b2bf/mybinder/values.yaml#L30-L40
Looks like outbound egress isn't restricted to these ports:
@manics I think this deserves it's own issue as that wasn't part of the config that I edited and is, therefore, perhaps a problem across all clusters?
What version of BinderHub is running on https://turing.mybinder.org/? It doesn't look like the latest.
I'm just attempting git pull master && cd mybinder && helm dep up && cd .. && python deploy.py turing turing
, but once again, helm is not playing nicely! (I wish it gave more useful error messages :( )
$ python deploy.py turing turing
The behavior of this command has been altered by the following extension: aks-preview
Merged "turing" as current context in ~/.kube/config
$HELM_HOME has been configured at ~/.helm.
Tiller (the Helm server-side component) has been upgraded to the current version.
Happy Helming!
deployment "tiller-deploy" successfully rolled out
Updating network-bans for turing
Starting helm upgrade for turing
Error: UPGRADE FAILED: a released named turing is in use, cannot re-use a name that is still in use
Traceback (most recent call last):
File "deploy.py", line 233, in <module>
main()
File "deploy.py", line 227, in main
deploy(args.release, "turing")
File "deploy.py", line 176, in deploy
subprocess.check_call(helm)
File "/Users/sgibson/anaconda3/envs/mybinder/lib/python3.7/subprocess.py", line 347, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '--namespace', 'turing', 'turing', 'mybinder', '--force', '--wait', '--timeout', '600', '-f', 'config/turing.yaml', '-f', 'secrets/config/common.yaml', '-f', 'secrets/config/turing.yaml']' returned non-zero exit status 1.
The NetworkPolicy was added last year https://github.com/jupyterhub/mybinder.org-deploy/pull/699 so it should be included.
Exactly, if it was added last year and is not effective then that's a separate issue to me incorporating the Turing into the federation? Or are you saying that they're not working on the Turing cluster but are on others? (I've just managed to set the Turing cluster on fire so can't test this right now.)
Do you have the admin PW for grafana.mybinder.org?
@betatim no I've never set up grafana before, how do I go about retrieving it?
Current bugs:
- testhub.hub23.turing.ac.uk has fake certificates whereas testbinder.hub23.turing.ac.uk has real ones, pretty sure I'm using letsencrypt-prod cluster issuer in both cases so I have no idea what's going on there
- @consideRatio do you have any advice here? How do I find out if I was banned from Let's Encrypt?
I solved this by re-deploying with new A records and new secrets.
- grafana pods are complaining about shared volume mounts, they can take a really long time to finally initialise (sometimes I have to manually delete them) - see pod description
I'm not sure if this is happening because the WiFi at the Turing is terrible this week (we're running a data study group and have a lot of people here using interwebs), it seems pretty variable as to whether the grafana pods switching over causes deploy.py
to time out or not. I might try tonight on my own connection.
- network policy and the egress ports aren't actually restricted (confirmed this was turing only on gitter)
@manics The hub is now at newhub.hub23.turing.ac.uk and the certificates should now be real. Can we check again if this is still an issue? If so, what do we need to do to solve this?
I can still ssh out of `https://newbinder.hub23.turing.ac.uk/
kubectl -n NAMESPACE describe netpol
should list the currently deployed network policies
Looks like the policies are created, next thing is to verify that the cluster implements them. https://docs.microsoft.com/en-us/azure/aks/use-network-policies#create-an-aks-cluster-and-enable-network-policy suggests it's optional, is there anything in Azure that tells you whether they're active on your cluster?
I will try and get hold of @trallard today
My guess would be this is where we can edit the network policies - but annoying that it's not automatically applied.
I think those are security rules which are independent from the K8s rules. It's the equivalent of a "physical" firewall operating at the network level. Then the K8s network policies are in addition to these, and they're implemented at the software level inside each Kubernetes VM. Either can be used to restrict network traffic, but obviously only the K8s network policies will be managed through the helm chart deployment.
The proposal we wrote in #1124 was accepted! We now have an Azure subscription with $10k to deploy a cluster on to :tada: So this issue is documenting the next steps we'll be taking.
TODOs
I'm going to try and keep the naming conventions similar between the Azure and GKE clusters where possible.
Open Questions
I'll keep this updated as more things occur to me 😄
cc: @KirstieJane