jzf2101 commented 5 years ago

We need to set up at JupyterHub with CUDA and then turn it into a binderhub

See https://github.com/jupyterhub/team-compass/issues/52 and https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/994#issue-373992464

@choldgraf @betatim @aculich @consideRatio @cmd-ntrf

Based on #81 we could use the binder account?

choldgraf commented 5 years ago

yes, feel free to use the binder account, but please be mindful of the $$$ you burn through :-)

consideRatio commented 5 years ago

I'm happy to do the initial setup of the GKE cluster and node pools etc, but I would need to be added to the google cloud project it should be done in if that is alright.

cmd-ntrf commented 5 years ago

Correct me if I am wrong, but unless pods have the ability to share a GPU in GKE, we will face issue when running more containers than we have GPUs. This potentially limits the scalability of Binder in these circumstances.

I guess as long as it is only for a demo, this is fine, but if we want people to try it on their own, we might need to setup a generous cluster.

jzf2101 commented 5 years ago

@cmd-ntrf based on #81 we are planning on limiting the scaling to put a ceiling on the number of concurrent users @yuvipanda mentioned we could possibly configure the spawner not to automatically scale

consideRatio commented 5 years ago

@jzf2101, the cheapest gpus will do right? they are still VERY powerful GPUs (nvidia tesla k80).

I need to tie the GKE cluster to a region and somewhat more loosly to a zone/datacenter and each zone has only certain GPUs available. I'll choose a zone where k80s and p100s are available, as compared to one with v100 only are available - ok?

Note that k80 < p100 < v100 < TPUs in price and performance, but a k80 is still a 1000+ USD graphicscard or similar to this.

jzf2101 commented 5 years ago

Yeah I figured that we could give people 1 K80 ideally since the demo is in eastern Canada somewhere in eastern Canada or eastern US would work? Perhaps we should give them the option to use a K80 or not so we don't have to use a K80 every time.

choldgraf commented 5 years ago

For GPU fanciness, Minimal Viable Product is totally fine in this case, IMO. We're just giving people a proof-of-concept here, don't need to mine bitcoin for the whole conference :-)

betatim commented 5 years ago

@consideRatio nice to have you helping out :) Let me know if you need to know something about the current setup in the GKE project.

The cost calculation we did was based on having one n1-standard-2 instance per pod and giving each instance a K80. As we will not auto-scale the cluster I think having one node per pod (or user) with one GPU attached to it will simplify things for the demo.

(I think in the long term (and "at scale") we'd want a setup where the node your notebook runs on doesn't actually have all that many resources and instead you use a library/tool to create worker Job(s) that run on a GPU enabled instance. Similar to how kubeflow works and dask. But this is something to discuss after the demo.)

consideRatio commented 5 years ago

@betatim ah i appreciate the insight about kubeflow, really wanted to learn more but have not done it yet. I know there is work done at making request to fractions of GPU, and ive heard timeframes of 6 months, but im thinking 12 months in practice. I think it will be best to use scheduled jobs for the gpu though no matter what as the workload tend to be bursty for ML purposes, and the GPU memory usage may peak hard during usage so using request/limits may be too simple solution to a complex problem anyhow.

@jzf2101: k80 gpu, eastern us, profile_list with cpu and gpu option provided.

Who points what domain to the IP later btw?

choldgraf commented 5 years ago

we can point a sub-domain of mybinder.org to the deployment (e.g. neurips.mybinder.org and hub.neurips.mybinder.org)

jzf2101 commented 5 years ago

We don't need that for initial setup though correct?

consideRatio commented 5 years ago

@jzf2101 nope no need for that during the initial setup

consideRatio commented 5 years ago

Creating a regional cluster named neurips in the binder-prod google project. It will have nodes only on the zone us-east1-d. The only other option would been us-east1-c assuming we want us-east and a zone with NVIDIA Tesla K80 GPU nodes.

(It is a good practice to always go for regional clusters and native VPC according to google k8s experts.)

jzf2101 commented 5 years ago

So wait us-east1-d doesn't have K80s?

consideRatio commented 5 years ago

Ooops i meant "the only other option", we have k80 on the current cluster. Will try get time to install binderhub as well tonight

betatim commented 5 years ago

Is there a repository already that has the configuration/deploy/helm chart in it?

jzf2101 commented 5 years ago

Status update- we have a binder hub up now and I can start putting images on it, but if we don't have k80s associated how doe we know that the repos I put in with CUDA work?

choldgraf commented 5 years ago

Hey all - a quick ping on this one. Do we have something working at mybinder.org that can run CUDA? I believe that @consideRatio got this to work, but please let us know if there are any blockers on this one that we can work through!

jzf2101 commented 5 years ago

@choldgraf I think there's been chatter on the gitter channel about what's going on. From what I recall, I think we're waiting for an increase in quota from GCP. @consideRatio has requested it but we haven't gotten it yet. We have 1 K80 and 100 preemptible ones - I'm assuming they're also K80s?

Also I think @cmd-ntrf and @choldgraf need to be added to the repo.

consideRatio commented 5 years ago

We now have 20 k80 GPU standard GPUs, and 100 preemptible ones. Work still remains, and I'll spend time in between 19-01 tonight Swedish time when I'm off work.

betatim commented 5 years ago

This comment serves as a todo list for the deployment (for repositories to run on the deployment use a different issue/comment). The list is sorted (top to bottom) from "absolutely needed there will be no demo without this" to "nice to have" to "let's try this out for extra credit". If you have a moment to contribute please work on things that are at the top of the list, when you check them off edit this post to add a link to where you performed that change, don't tackle things further down if there are open items towards the top.

Currently deployed: https://github.com/consideRatio/neurips.mybinder.org-deploy

To do list (please edit it if you think of things that need to get done, keeping the sort order in mind, so insert them at a good place, not at the end):

[x] basic deployment with GPU support works (https://github.com/consideRatio/neurips.mybinder.org-deploy/commit/d7ab128f6da4d3a956c12571dde55acf7de0765e)
[x] demo repository to verify GPU support (https://github.com/betatim/basic-gpu-binder)
[x] get K80 quota increase to 20 in us-east1
[x] get K80 quota increase to 100 in us-east1
[x] get preemptible K80 quota increase to 100 in us-east1
[x] setup an autoscaling of the gpus with placeholders to prepare nodes in advance
[x] configure and check inactive pods are culled within 10minutes, take over settings from mybinder.org config but watch out not to copy CORS settings as well
- [ ] verify it works by being patient and watching :-/
[x] configure JupyterHub to emit an error message when max number of pods is reached,
[x] establish protocol around auto scaling cluster (see the comment below)
[x] limit outgoing network traffic from user pods
- Erik lack knowledge on this, would like to be provided a test case to verify if its as it should be or not
[x] limit ports that outgoings traffic can connect to (only HTTP, HTTPS and git should be allowed) see here
[x] make sure CORS is disabled so that neurips.mybinder.org can't be used by nbinteract/thebe/etc
- checked by comparing headers from the build endpoint on mybinder.org and neurips.mybinder.org
[ ] create repository that runs a well known notebook from the keras/tensorflow/pytorch documentatation
- @jzf2101 can you provide a list of these?
[ ] GitHub corner icon linking to our GitHub http://tholman.com/github-corners/
[ ] customise template of landing page (example)

BONUS:
~Utilize JupyterLab by default with the BinderHub~ We would need to configure something additional regarding the provided filepath though i think. I think this can be quite essential, because this is what makes binderhub stand out from google colab the way i see it, the support for the moder jupyter stuff as compared to being stuck with the forked old stuff.

minrk commented 5 years ago

CORS should be disabled by default, but it's worth verifying.

minrk commented 5 years ago

5 minutes might be too aggressive to cull. The activity metric is only updated on the Hub periodically when checking in with the proxy, so this needs to be less frequent than the proxy-check-routes interval

Utilize JupyterLab by default with the BinderHub

Setting c.Spawner.default_url = '/lab/ should be all that's needed for this, I think.

consideRatio commented 5 years ago

Would it function with provided "start this file on startup" as well?

minrk commented 5 years ago

Ah, no. You'd still need to use 'url to open' instead of 'file to open'

minrk commented 5 years ago

.We can teach binderhub about jupyterlab's file-open urls now that they exist. Then we can have an option on binderhub to build /lab/tree/... urls instead of /notebooks/....

consideRatio commented 5 years ago

@betatim @minrk should we go for lab or not even without the fileopen support?

minrk commented 5 years ago

I wouldn't do it today, no.

consideRatio commented 5 years ago

Deployment oversight guide

The conference is in Montreal (GMT-5). And things may be relevant to use already in the morning.

Technical knowledge to understand things better

user-placeholder pods, they ensures we we scale up in advance so the users dont have to wait about 7 minutes. If we think there wont be more than 15 people arriving in 7 minutes, we could should reasonably have 7 placeholder pods.
The node-pool's cluster autoscaler's upper autoscaling limit, it caps the amount of nodes that can be added at maximum.
The jupyterhub's activeServerLimit, it will not even attempt to spawn a new pod if there is too many spawned already, this could be nice as the hub wont know about the upper cluster autoscaler limit and fail ugly if we have it to an unlimited amount. We should let this be the same as the cluster autoscaler limit.

Below is the planned operations on the deployment during the conference

Monday night, current

We now have 5 GPU potentially available and a single user-placeholder ensures we have at least one quickly accessible.

Tuesday morning

At 6 AM @consideRatio does the following ( 12:00 Swedish time )

  # Scale placeholders to 15
  kubectl scale -n neurips sts/user-placeholder --replicas 15

  # Update jupyterhubs upper limit of users to 100
  hub:
    activeServerLimit: 100

Update the autoscaler's upper limit to 100 in GCP

Starts up the demo-repos @jzf2101 have provided me with by then > awaits the images be built, inspect the image names by using:

kubectl get pod -o yaml <podname> | grep "image: gcr"
# or verify things here: https://console.cloud.google.com/gcr/images/binder-prod?project=binder-prod

Adds them to the prePuller.extraImages.demo1.name / tag etc. and re-deploys the chart with its new configuration
At any time @choldgraf/@jzf2101 could do the following to optimize things
```
# Scale placeholders to X
kubectl scale -n neurips sts/user-placeholder --replicas X
```
Thursday evening
1. When its time to shut it all down, we do the following procedure:
```
# remove the placeholders
kubectl scale -n neurips sts/user-placeholder --replicas 0
```
```
# remove the actual users
kubectl delete pod -n neurips -l component=singleuser-server
```
Update the autoscaler's upper limit to 0 in GCP
Force resize the cluster to 0 nodes in GCP

betatim commented 5 years ago

This is a great plan/schedule!

For the point on pre-pulling images: I think you mean the right thing but the way it is written is ambiguous so my two cents: you need to look at the name of the image that repo2docker creates for that particular repository. This is different from the name of the pod/image used by any pod running. There are two ways to see the image name:

look in the registry for the image and tag after the build has finished
kubectl describe the jupyter-... pod that gets launched after the build succeeds

consideRatio commented 5 years ago

ah yepp, i was thinking of the -o yaml as well, in practice:

kubectl get pod -o yaml <podname> | grep "image: gcr"

And as you suggest I can also look here, this is easiest to be confident about: https://console.cloud.google.com/gcr/images/binder-prod?project=binder-prod

jupyterhub / team-compass

Set up a JupyterHub that can run CUDA code for NeurIPS #92

[ ] customise template of landing page (example)

Deployment oversight guide

Technical knowledge to understand things better

Monday night, current

Tuesday morning

At 6 AM @consideRatio does the following ( 12:00 Swedish time )

At any time @choldgraf/@jzf2101 could do the following to optimize things

Thursday evening