jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 74 forks source link

We would like to join the federation #1772

Open Hassan-Alzahrani opened 3 years ago

Hassan-Alzahrani commented 3 years ago

Hi, We are at KAUST a private research university that would like to join Binder Federation. currently, we can offer two servers each one with two sockets and 512GB of memory, and since we are overwhelmed by day-to-day tasks we prefer if you take full control of it. let me know how we can move forward.

Regards, Hassan Alzahrani

Office:+96628081041 Mobile:+966544701104 Email: Hassan.Alzahrani@kaust.edu.sa Automation & Workflows Specialist IT - Research Computing Department King Abdullah University of Science and Technology

welcome[bot] commented 3 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

betatim commented 3 years ago

Hi 👋 !

That's great news. Because you are quite busy and the request is a bit unusual (take over a full machine, not just a k8s cluster): do you have time to join our team meeting this Thursday (21 January, more details here)?

If not don't worry. I think from our side we will discuss your offer. In particular we normally don't operate a kubernetes cluster on bare metal ourselves. So it needs a bit of discussing to see if we can take on such a task, what it would mean, etc.

willingc commented 3 years ago

Now 4 machines. We need a project plan for steps to move forward. @betatim @minrk @sgibson91 Can we give a checklist of items to get started.

willingc commented 3 years ago

@MridulS fyi

Has good insights re: on-prem K8

willingc commented 3 years ago

From monthly chat:

Discussion included:

Thank you Hassan.

sgibson91 commented 3 years ago

Thank you Hassan!

I think the next steps are for you to try and get Kubernetes running on the machines and then we can manage BinderHub on top of that. @MridulS and possibly @manics may be able to help with advice on how to set up Kubernetes on-prem.

davidrpugh commented 3 years ago

@Hassan-Alzahrani Hurray! So happy to see this moving forward. Please let me know if there is anything that I can do to help move this along.

Hassan-Alzahrani commented 3 years ago

Thank you all for your help and support, Currently, we are working internally to apply KAUST policies. from there we will be able to provide k8s to the Binderhub team so we can move forward to the next step. I will let you know as soon as we get k8s up and running.

Hassan-Alzahrani commented 2 years ago

Hi there, I hope you’re doing well. Finally and after a long time, we have Kubernetes up and running with the below specs:

Total Resources
7 Nodes
- 3 master nodes
- 4 worker nodes 
Capacity
- vCPU
  Reserved 69.86 / 328
- Memory
  Reserved 0 / 2.71 TiB
- Storage
  1.65 TiB
manics commented 2 years ago

That's great! I've added this to the agenda for Thursday's JupyterHub team meeting

davidrpugh commented 2 years ago

@Hassan-Alzahrani This is great news indeed! I will try to join the JupyterHub team meeting tomorrow but can only stay for 30 minutes or so.

minrk commented 2 years ago

@Hassan-Alzahrani @davidrpugh Thanks for the great discussion at the meeting! The main things we need to get started are:

  1. a kubeconfig file with credentials for the cluster
  2. configuration/credentials for a docker registry (The OVH deployment uses Harbor, which I think you said you'd deploy as well)
  3. what kind of service and/or domain shall we use? Typically, we prefer services with type: LoadBalancer and would use e.g. kaust.mybinder.org for the domain. But if you'd like to control it, that's fine, too.

The secrets can be transferred via any secure mechanism you prefer (ssh-vault, keybase, etc.) and then they will be added to this repo in an encrypted form.

Then we can get started trying to deploy and see what the next steps will be.

minrk commented 2 years ago

@Hassan-Alzahrani @davidrpugh is there anything you need to help move this forward?

davidrpugh commented 2 years ago

@minrk Sorry for the tardy reply. No, there isn't anything that we need from you we have done everything technical on our end and are just waiting for the last approval from the KAUST InfoSec team before we deploy. Will update you next week.

betatim commented 2 years ago

I am looking forward to this and would like to help out with making this happen when the security team has given their thumbs up.

Hassan-Alzahrani commented 2 years ago

Hi, @minrk @betatim
So far we have InfoSec green light, and I hope we never face any obstacles down the road with them. now we have the Rancher portal published if you can send me an email, I will make sure to create your account and share the details.

minrk commented 2 years ago

Great, thanks! I was able to connect and poke around. I have a grant deadline on April 20, so I need to be head-down on that for the next couple weeks, but if @betatim can take the lead, hopefully we can get up and running!

davidrpugh commented 2 years ago

@Hassan-Alzahrani this is great news! Not sure if I can do much except provide moral support at this point. If there is anything specific that I can help with please let me know.

betatim commented 2 years ago

@Hassan-Alzahrani my email is betatim@gmail.com

minrk commented 2 years ago

I'm out from under a pile of writing, and happy to help out with this, @betatim. Let me know what would be helpful!

betatim commented 2 years ago

I can sign in to the rancher UI now and see the cluster 🎉

One thing I am not sure about: how do we create a "robot account" that GH Actions can use to deploy as? From my tour around Rancher I think that Rancher itself does not support "service accounts" with limited permissions. On a different Rancher based k8s cluster I ended up creating a kubernetes ServiceAccount in the namespace that it should operate in and used RoleBindings to give it permission to do things inside that namespace. Does someone else have experience with this? Do you think this is a sensible way to go about this?

In terms of next steps I am thinking:

wdyt?

betatim commented 2 years ago

Maybe the fact that I don't fully understand how ingress will/should work: should we try and get that working and understood with some dummy pods first, before trying to deploy all of the mybinder.org chart which bringsall sorts of complexity?

minrk commented 2 years ago

I think creating an SA via rancher, and generating kubectl config ought to work.

Exploring the nginx-ingress situation should also be one of the first things, and doesn't need to block. Certainly the easiest is to deploy our own ingress-controller like we usually do, but this may not work with how the cluster's set up.

MridulS commented 2 years ago

How about first setting up the binderhub instance manually on the KAUST cluster and just add it to the redirector. (like GESIS is run right now). We could make this a part of the mybinder-deploy github actions after the first test run?

betatim commented 2 years ago

@Hassan-Alzahrani do you know what the external IP of the cluster is? The IP that we should setup as A record for the hostname that users will use to reach the cluster?

minrk commented 2 years ago

A CNAME record also works, if there's already DNS.

General question: how is access to the cluster expected to work (i.e. it seems like an ingress controller is deployed somewhere, but we don't have permission to see it status, ips, etc.)?

Also permissions: while we appear to be assigned an admin ClusterRole, we don't have permission to do things like list namespaces, nodes, etc. This will make admin a bit tricky. What level of permissions should we expect to have?

minrk commented 2 years ago

The permissions are also proving to be a deployment challenge. Quite a few things are failing with permission errors, so it's hard to proceed. I managed to turn lots of things off, but deployment finally failed with uninformative:

Request unsuccessful. Incapsula incident ID: 765000760013204688-21617223285936908

Hassan-Alzahrani commented 2 years ago

Hi All,

betatim commented 2 years ago

Could you share the hostname that we should point kaust.mybinder.org to?

Hassan-Alzahrani commented 2 years ago

point to 45.223.20.138, and I'll check with the security team to allow it on their side.

minrk commented 2 years ago

Thanks for the help! Getting a little further. Attempts to create network policies are failing with opaque Request unsuccessful. Incapsula incident ID: 1099000710047462809-107588359606178252

We still appear to lack sufficient permissions to install the scheduling priority ClusterRole.

I've created an ingress and can see from events that it was picked up by cert-manager and the ingress-controller, but it's not available at 45.223.20.138.

Mounting volumes also appears to have an issue:

  Warning  FailedAttachVolume      9m43s                 attachdetach-controller  Multi-Attach error for volume "pvc-18d5105f-278e-46bb-8604-103fc1fed815" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  9m24s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-18d5105f-278e-46bb-8604-103fc1fed815"
  Warning  FailedMount             3m6s                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[secret pvc kube-api-access-djdz5 config]: timed out waiting for the condition
  Warning  FailedMount             56s (x12 over 9m11s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-18d5105f-278e-46bb-8604-103fc1fed815" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/longhorn/pvc-18d5105f-278e-46bb-8604-103fc1fed815 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-18d5105f-278e-46bb-8604-103fc1fed815/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-18d5105f-278e-46bb-8604-103fc1fed815/globalmount: /dev/longhorn/pvc-18d5105f-278e-46bb-8604-103fc1fed815 already mounted or mount point busy.
  Warning  FailedMount  51s (x3 over 7m40s)  kubelet  Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[config secret pvc kube-api-access-djdz5]: timed out waiting for the condition
minrk commented 2 years ago

I also see that prometheus is already deployed on the cluster. Is it possible to have access to that?

minrk commented 2 years ago

I've pointed *.kaust.mybinder.org to 45.223.20.138, but this simple ingress/service/pod doesn't get exposed:

debug-ingress.yaml ```yaml apiVersion: v1 kind: Service metadata: labels: app: debug name: debug spec: ports: - name: http port: 80 protocol: TCP targetPort: 80 selector: app: debug type: ClusterIP --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: # ingress.kubernetes.io/proxy-body-size: 64m kubernetes.io/ingress.class: nginx kubernetes.io/tls-acme: "true" labels: app: debug name: debug spec: rules: - host: debug2.kaust.mybinder.org http: paths: - backend: service: name: debug port: name: http path: / pathType: Prefix tls: - hosts: - debug2.kaust.mybinder.org secretName: tls-debug --- apiVersion: v1 kind: Pod metadata: name: debug labels: app: debug spec: containers: - name: basic image: nginx ports: - containerPort: 80 name: http protocol: TCP ```

The same content (with updated host) gets exposed with https on other clusters.

If you can share what an ingress is expected to look like for this cluster, to work with the chosen ingress controller and cert-manager for letsencrypt, I think we can make progress.

The apparent lack of network policy support may be an issue, though. Do you know what's happening there?

Hassan-Alzahrani commented 2 years ago

@minrk @betatim I have granted you admin privileges. I suggest that we have a 1 - 2 hours deployment session where we can meet over zoom and try to solve all issues wdyt?

minrk commented 2 years ago

I can do any time 10-3 CEST tomorrow.

davidrpugh commented 2 years ago

Go team! Moral support is all I have to provide for the moment. 🤣

Sent from my iPhone

On Apr 27, 2022, at 02:04, Min RK @.***> wrote:

 I can do any time 10-3 CEST tomorrow.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

minrk commented 2 years ago

Some updates: I had a chat with @Hassan-Alzahrani yesterday and we were able to make some progress, but it's probably going to be a bit before this is fully ready.

We faced some issues:

Because of the Eid holiday, turnaround may be slow for the next week or so.

davidrpugh commented 2 years ago

@Hassan-Alzahrani Is it possible to have a public-private split of the Harbor container registry? Or is this too much overhead to manage?

@minrk I assume that other federation members run their container registries in the cloud. Any idea how much other Binder federation members are paying to run a cloud-hosted container registry?

minrk commented 2 years ago

FWIW, I don't think it's important for us that the registry be publicly accessible. That's just the easiest way to get a trusted cert for SSL via ACME if it has a public URL. (note: accessible from outside doesn't mean the images can be pulled - Our Google image registry, for instance, is totally private, but not walled-off from the Internet). If your internal infrastructure makes a trusted connection for an internal service easy, that's fine, too - that's all we really care about, is the trusted, secure connection for our internal components, and whatever's the easiest path to that.

I assume that other federation members run their container registries in the cloud OVH runs Harbor internally. GKE uses Google's own GCR (google container registry), and it costs somewhere around $1k/month, depending on how routinely we scrub old images (Harbor has better delete-stale-images options than GCR). Turing uses DockerHub. I don't know what they pay, or even if they are able to use the free tier (@callummole or @sgibson91?).

sgibson91 commented 2 years ago

Turing uses the free tier on DockerHub :)

NasrHassanein commented 2 years ago

@minrk We are in the process of getting Harbor exposed to the public. We are still waiting for InfoSec vulnerability scan.

davidrpugh commented 2 years ago

@Hassan-Alzahrani would it be possible for us to use the free tier on DockerHub for this task and (just use Rancher for our internal purposes) similar to what @sgibson91 mentioned is being done by The Turing Institute?

I have no idea about the technical details just trying to make sure that we are considering all options.

MridulS commented 2 years ago

Just to give another data point, we use dockerhub at GESIS too but we pay for the "pro" plan I think so we get enough image pulls.

minrk commented 2 years ago

@NasrHassanein any update here? Using Docker Hub is fine for us, since it is working for others. The mysterious firewall filtering issues preventing API requests with certain strings also need to be resolved.

davidrpugh commented 2 years ago

@minrk Suppose we use a public Docker Hub free tier for now while we wait for the longer term solution using our internal registry. What is the impact on BinderHub if we hit our free tier quota for pulls?

MridulS commented 2 years ago

When the free tier quota (100 pulls per 6 hours) is hit the users who are redirected to the KAUST cluster will see an error message in their logs about this, and they can't really do much unless they know how to forcefully redirect themselves to other clusters. Just by creating an account on DockerHub you can get more pulls (200 over 6 hours) and it may even just work most of the time.

sgibson91 commented 2 years ago

it may even just work most of the time

I believe this is the experience of the Turing cluster. I created a Docker Hub organisation where the images get pushed too and so that multiple users could have access to the org for cleaning purposes if necessary. There is a turingmybinder user account and its username and password is what we give to the Turing BinderHub to login and push with. I haven't heard any recent reports of the Docker Hub on Turing being an issue?

MridulS commented 2 years ago

I created a Docker Hub organisation where the images get pushed too and so that multiple users could have access to the org for cleaning purposes if necessary. There is a turingmybinder user account and its username and password is what we give to the Turing BinderHub to login and push with.

+1 on this setup, this is exactly what GESIS does too.

minrk commented 2 years ago

Yup, a kaustmybinder docker hub org should work fine. The firewall/filtering issues are perhaps the most pressing since we can't really interact with the cluster reliably right now.

davidrpugh commented 2 years ago

@Hassan-Alzahrani I think we should move forward with a free tier account on DockerHub as described above. The issues raised by the InfoSec team will not be solved anytime soon and the value add from exposing our internal container registry seems marginal. What can I do to move this process along?