Open Hassan-Alzahrani opened 3 years ago
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
Hi 👋 !
That's great news. Because you are quite busy and the request is a bit unusual (take over a full machine, not just a k8s cluster): do you have time to join our team meeting this Thursday (21 January, more details here)?
If not don't worry. I think from our side we will discuss your offer. In particular we normally don't operate a kubernetes cluster on bare metal ourselves. So it needs a bit of discussing to see if we can take on such a task, what it would mean, etc.
Now 4 machines. We need a project plan for steps to move forward. @betatim @minrk @sgibson91 Can we give a checklist of items to get started.
@MridulS fyi
Has good insights re: on-prem K8
From monthly chat:
Discussion included:
Thank you Hassan.
Thank you Hassan!
I think the next steps are for you to try and get Kubernetes running on the machines and then we can manage BinderHub on top of that. @MridulS and possibly @manics may be able to help with advice on how to set up Kubernetes on-prem.
@Hassan-Alzahrani Hurray! So happy to see this moving forward. Please let me know if there is anything that I can do to help move this along.
Thank you all for your help and support, Currently, we are working internally to apply KAUST policies. from there we will be able to provide k8s to the Binderhub team so we can move forward to the next step. I will let you know as soon as we get k8s up and running.
Hi there, I hope you’re doing well. Finally and after a long time, we have Kubernetes up and running with the below specs:
Total Resources
7 Nodes
- 3 master nodes
- 4 worker nodes
Capacity
- vCPU
Reserved 69.86 / 328
- Memory
Reserved 0 / 2.71 TiB
- Storage
1.65 TiB
That's great! I've added this to the agenda for Thursday's JupyterHub team meeting
@Hassan-Alzahrani This is great news indeed! I will try to join the JupyterHub team meeting tomorrow but can only stay for 30 minutes or so.
@Hassan-Alzahrani @davidrpugh Thanks for the great discussion at the meeting! The main things we need to get started are:
kubeconfig
file with credentials for the clusterkaust.mybinder.org
for the domain. But if you'd like to control it, that's fine, too.The secrets can be transferred via any secure mechanism you prefer (ssh-vault, keybase, etc.) and then they will be added to this repo in an encrypted form.
Then we can get started trying to deploy and see what the next steps will be.
@Hassan-Alzahrani @davidrpugh is there anything you need to help move this forward?
@minrk Sorry for the tardy reply. No, there isn't anything that we need from you we have done everything technical on our end and are just waiting for the last approval from the KAUST InfoSec team before we deploy. Will update you next week.
I am looking forward to this and would like to help out with making this happen when the security team has given their thumbs up.
Hi, @minrk @betatim
So far we have InfoSec green light, and I hope we never face any obstacles down the road with them. now we have the Rancher portal published if you can send me an email, I will make sure to create your account and share the details.
Great, thanks! I was able to connect and poke around. I have a grant deadline on April 20, so I need to be head-down on that for the next couple weeks, but if @betatim can take the lead, hopefully we can get up and running!
@Hassan-Alzahrani this is great news! Not sure if I can do much except provide moral support at this point. If there is anything specific that I can help with please let me know.
@Hassan-Alzahrani my email is betatim@gmail.com
I'm out from under a pile of writing, and happy to help out with this, @betatim. Let me know what would be helpful!
I can sign in to the rancher UI now and see the cluster 🎉
One thing I am not sure about: how do we create a "robot account" that GH Actions can use to deploy as? From my tour around Rancher I think that Rancher itself does not support "service accounts" with limited permissions. On a different Rancher based k8s cluster I ended up creating a kubernetes ServiceAccount
in the namespace that it should operate in and used RoleBindings
to give it permission to do things inside that namespace. Does someone else have experience with this? Do you think this is a sensible way to go about this?
In terms of next steps I am thinking:
kaust.mybinder.org
to the external IP of the clusterwdyt?
Maybe the fact that I don't fully understand how ingress will/should work: should we try and get that working and understood with some dummy pods first, before trying to deploy all of the mybinder.org chart which bringsall sorts of complexity?
I think creating an SA via rancher, and generating kubectl config ought to work.
Exploring the nginx-ingress situation should also be one of the first things, and doesn't need to block. Certainly the easiest is to deploy our own ingress-controller like we usually do, but this may not work with how the cluster's set up.
How about first setting up the binderhub instance manually on the KAUST cluster and just add it to the redirector. (like GESIS is run right now). We could make this a part of the mybinder-deploy github actions after the first test run?
@Hassan-Alzahrani do you know what the external IP of the cluster is? The IP that we should setup as A record for the hostname that users will use to reach the cluster?
A CNAME record also works, if there's already DNS.
General question: how is access to the cluster expected to work (i.e. it seems like an ingress controller is deployed somewhere, but we don't have permission to see it status, ips, etc.)?
Also permissions: while we appear to be assigned an admin ClusterRole, we don't have permission to do things like list namespaces, nodes, etc. This will make admin a bit tricky. What level of permissions should we expect to have?
The permissions are also proving to be a deployment challenge. Quite a few things are failing with permission errors, so it's hard to proceed. I managed to turn lots of things off, but deployment finally failed with uninformative:
Request unsuccessful. Incapsula incident ID: 765000760013204688-21617223285936908
Hi All,
Could you share the hostname that we should point kaust.mybinder.org to?
point to 45.223.20.138, and I'll check with the security team to allow it on their side.
Thanks for the help! Getting a little further. Attempts to create network policies are failing with opaque Request unsuccessful. Incapsula incident ID: 1099000710047462809-107588359606178252
We still appear to lack sufficient permissions to install the scheduling priority ClusterRole.
I've created an ingress and can see from events that it was picked up by cert-manager and the ingress-controller, but it's not available at 45.223.20.138.
Mounting volumes also appears to have an issue:
Warning FailedAttachVolume 9m43s attachdetach-controller Multi-Attach error for volume "pvc-18d5105f-278e-46bb-8604-103fc1fed815" Volume is already exclusively attached to one node and can't be attached to another
Normal SuccessfulAttachVolume 9m24s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-18d5105f-278e-46bb-8604-103fc1fed815"
Warning FailedMount 3m6s kubelet Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[secret pvc kube-api-access-djdz5 config]: timed out waiting for the condition
Warning FailedMount 56s (x12 over 9m11s) kubelet MountVolume.MountDevice failed for volume "pvc-18d5105f-278e-46bb-8604-103fc1fed815" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/longhorn/pvc-18d5105f-278e-46bb-8604-103fc1fed815 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-18d5105f-278e-46bb-8604-103fc1fed815/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-18d5105f-278e-46bb-8604-103fc1fed815/globalmount: /dev/longhorn/pvc-18d5105f-278e-46bb-8604-103fc1fed815 already mounted or mount point busy.
Warning FailedMount 51s (x3 over 7m40s) kubelet Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[config secret pvc kube-api-access-djdz5]: timed out waiting for the condition
I also see that prometheus is already deployed on the cluster. Is it possible to have access to that?
I've pointed *.kaust.mybinder.org to 45.223.20.138, but this simple ingress/service/pod doesn't get exposed:
The same content (with updated host) gets exposed with https on other clusters.
If you can share what an ingress is expected to look like for this cluster, to work with the chosen ingress controller and cert-manager for letsencrypt, I think we can make progress.
The apparent lack of network policy support may be an issue, though. Do you know what's happening there?
@minrk @betatim I have granted you admin privileges. I suggest that we have a 1 - 2 hours deployment session where we can meet over zoom and try to solve all issues wdyt?
I can do any time 10-3 CEST tomorrow.
Go team! Moral support is all I have to provide for the moment. 🤣
Sent from my iPhone
On Apr 27, 2022, at 02:04, Min RK @.***> wrote:
 I can do any time 10-3 CEST tomorrow.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
Some updates: I had a chat with @Hassan-Alzahrani yesterday and we were able to make some progress, but it's probably going to be a bit before this is fully ready.
We faced some issues:
Because of the Eid holiday, turnaround may be slow for the next week or so.
@Hassan-Alzahrani Is it possible to have a public-private split of the Harbor container registry? Or is this too much overhead to manage?
@minrk I assume that other federation members run their container registries in the cloud. Any idea how much other Binder federation members are paying to run a cloud-hosted container registry?
FWIW, I don't think it's important for us that the registry be publicly accessible. That's just the easiest way to get a trusted cert for SSL via ACME if it has a public URL. (note: accessible from outside doesn't mean the images can be pulled - Our Google image registry, for instance, is totally private, but not walled-off from the Internet). If your internal infrastructure makes a trusted connection for an internal service easy, that's fine, too - that's all we really care about, is the trusted, secure connection for our internal components, and whatever's the easiest path to that.
I assume that other federation members run their container registries in the cloud OVH runs Harbor internally. GKE uses Google's own GCR (google container registry), and it costs somewhere around $1k/month, depending on how routinely we scrub old images (Harbor has better delete-stale-images options than GCR). Turing uses DockerHub. I don't know what they pay, or even if they are able to use the free tier (@callummole or @sgibson91?).
Turing uses the free tier on DockerHub :)
@minrk We are in the process of getting Harbor exposed to the public. We are still waiting for InfoSec vulnerability scan.
@Hassan-Alzahrani would it be possible for us to use the free tier on DockerHub for this task and (just use Rancher for our internal purposes) similar to what @sgibson91 mentioned is being done by The Turing Institute?
I have no idea about the technical details just trying to make sure that we are considering all options.
Just to give another data point, we use dockerhub at GESIS too but we pay for the "pro" plan I think so we get enough image pulls.
@NasrHassanein any update here? Using Docker Hub is fine for us, since it is working for others. The mysterious firewall filtering issues preventing API requests with certain strings also need to be resolved.
@minrk Suppose we use a public Docker Hub free tier for now while we wait for the longer term solution using our internal registry. What is the impact on BinderHub if we hit our free tier quota for pulls?
When the free tier quota (100 pulls per 6 hours) is hit the users who are redirected to the KAUST cluster will see an error message in their logs about this, and they can't really do much unless they know how to forcefully redirect themselves to other clusters. Just by creating an account on DockerHub you can get more pulls (200 over 6 hours) and it may even just work most of the time.
it may even just work most of the time
I believe this is the experience of the Turing cluster. I created a Docker Hub organisation where the images get pushed too and so that multiple users could have access to the org for cleaning purposes if necessary. There is a turingmybinder user account and its username and password is what we give to the Turing BinderHub to login and push with. I haven't heard any recent reports of the Docker Hub on Turing being an issue?
I created a Docker Hub organisation where the images get pushed too and so that multiple users could have access to the org for cleaning purposes if necessary. There is a turingmybinder user account and its username and password is what we give to the Turing BinderHub to login and push with.
+1 on this setup, this is exactly what GESIS does too.
Yup, a kaustmybinder
docker hub org should work fine. The firewall/filtering issues are perhaps the most pressing since we can't really interact with the cluster reliably right now.
@Hassan-Alzahrani I think we should move forward with a free tier account on DockerHub as described above. The issues raised by the InfoSec team will not be solved anytime soon and the value add from exposing our internal container registry seems marginal. What can I do to move this process along?
Hi, We are at KAUST a private research university that would like to join Binder Federation. currently, we can offer two servers each one with two sockets and 512GB of memory, and since we are overwhelmed by day-to-day tasks we prefer if you take full control of it. let me know how we can move forward.
Regards, Hassan Alzahrani
Office:+96628081041 Mobile:+966544701104 Email: Hassan.Alzahrani@kaust.edu.sa Automation & Workflows Specialist IT - Research Computing Department King Abdullah University of Science and Technology