jupyterhub / team-compass

A repository for team interaction, syncing, and handling meeting notes across the JupyterHub ecosystem.
http://jupyterhub-team-compass.readthedocs.io
62 stars 33 forks source link

Set up a JupyterHub that can run CUDA code for NeurIPS #92

Closed jzf2101 closed 5 years ago

jzf2101 commented 5 years ago

We need to set up at JupyterHub with CUDA and then turn it into a binderhub

See https://github.com/jupyterhub/team-compass/issues/52 and https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/994#issue-373992464

@choldgraf @betatim @aculich @consideRatio @cmd-ntrf

Based on #81 we could use the binder account?

choldgraf commented 5 years ago

yes, feel free to use the binder account, but please be mindful of the $$$ you burn through :-)

consideRatio commented 5 years ago

I'm happy to do the initial setup of the GKE cluster and node pools etc, but I would need to be added to the google cloud project it should be done in if that is alright.

cmd-ntrf commented 5 years ago

Correct me if I am wrong, but unless pods have the ability to share a GPU in GKE, we will face issue when running more containers than we have GPUs. This potentially limits the scalability of Binder in these circumstances.

I guess as long as it is only for a demo, this is fine, but if we want people to try it on their own, we might need to setup a generous cluster.

jzf2101 commented 5 years ago

@cmd-ntrf based on #81 we are planning on limiting the scaling to put a ceiling on the number of concurrent users @yuvipanda mentioned we could possibly configure the spawner not to automatically scale

consideRatio commented 5 years ago

@jzf2101, the cheapest gpus will do right? they are still VERY powerful GPUs (nvidia tesla k80).

I need to tie the GKE cluster to a region and somewhat more loosly to a zone/datacenter and each zone has only certain GPUs available. I'll choose a zone where k80s and p100s are available, as compared to one with v100 only are available - ok?

Note that k80 < p100 < v100 < TPUs in price and performance, but a k80 is still a 1000+ USD graphicscard or similar to this.

jzf2101 commented 5 years ago

Yeah I figured that we could give people 1 K80 ideally since the demo is in eastern Canada somewhere in eastern Canada or eastern US would work? Perhaps we should give them the option to use a K80 or not so we don't have to use a K80 every time.

choldgraf commented 5 years ago

For GPU fanciness, Minimal Viable Product is totally fine in this case, IMO. We're just giving people a proof-of-concept here, don't need to mine bitcoin for the whole conference :-)

betatim commented 5 years ago

@consideRatio nice to have you helping out :) Let me know if you need to know something about the current setup in the GKE project.

The cost calculation we did was based on having one n1-standard-2 instance per pod and giving each instance a K80. As we will not auto-scale the cluster I think having one node per pod (or user) with one GPU attached to it will simplify things for the demo.

(I think in the long term (and "at scale") we'd want a setup where the node your notebook runs on doesn't actually have all that many resources and instead you use a library/tool to create worker Job(s) that run on a GPU enabled instance. Similar to how kubeflow works and dask. But this is something to discuss after the demo.)

consideRatio commented 5 years ago

@betatim ah i appreciate the insight about kubeflow, really wanted to learn more but have not done it yet. I know there is work done at making request to fractions of GPU, and ive heard timeframes of 6 months, but im thinking 12 months in practice. I think it will be best to use scheduled jobs for the gpu though no matter what as the workload tend to be bursty for ML purposes, and the GPU memory usage may peak hard during usage so using request/limits may be too simple solution to a complex problem anyhow.

@jzf2101: k80 gpu, eastern us, profile_list with cpu and gpu option provided.

Who points what domain to the IP later btw?

choldgraf commented 5 years ago

we can point a sub-domain of mybinder.org to the deployment (e.g. neurips.mybinder.org and hub.neurips.mybinder.org)

jzf2101 commented 5 years ago

We don't need that for initial setup though correct?

consideRatio commented 5 years ago

@jzf2101 nope no need for that during the initial setup

consideRatio commented 5 years ago

Creating a regional cluster named neurips in the binder-prod google project. It will have nodes only on the zone us-east1-d. The only other option would been us-east1-c assuming we want us-east and a zone with NVIDIA Tesla K80 GPU nodes.

(It is a good practice to always go for regional clusters and native VPC according to google k8s experts.)

jzf2101 commented 5 years ago

So wait us-east1-d doesn't have K80s?

consideRatio commented 5 years ago

Ooops i meant "the only other option", we have k80 on the current cluster. Will try get time to install binderhub as well tonight

betatim commented 5 years ago

Is there a repository already that has the configuration/deploy/helm chart in it?

jzf2101 commented 5 years ago

Status update- we have a binder hub up now and I can start putting images on it, but if we don't have k80s associated how doe we know that the repos I put in with CUDA work?

choldgraf commented 5 years ago

Hey all - a quick ping on this one. Do we have something working at mybinder.org that can run CUDA? I believe that @consideRatio got this to work, but please let us know if there are any blockers on this one that we can work through!

jzf2101 commented 5 years ago

@choldgraf I think there's been chatter on the gitter channel about what's going on. From what I recall, I think we're waiting for an increase in quota from GCP. @consideRatio has requested it but we haven't gotten it yet. We have 1 K80 and 100 preemptible ones - I'm assuming they're also K80s?

Also I think @cmd-ntrf and @choldgraf need to be added to the repo.

consideRatio commented 5 years ago

We now have 20 k80 GPU standard GPUs, and 100 preemptible ones. Work still remains, and I'll spend time in between 19-01 tonight Swedish time when I'm off work.

betatim commented 5 years ago

This comment serves as a todo list for the deployment (for repositories to run on the deployment use a different issue/comment). The list is sorted (top to bottom) from "absolutely needed there will be no demo without this" to "nice to have" to "let's try this out for extra credit". If you have a moment to contribute please work on things that are at the top of the list, when you check them off edit this post to add a link to where you performed that change, don't tackle things further down if there are open items towards the top.


Currently deployed: https://github.com/consideRatio/neurips.mybinder.org-deploy


To do list (please edit it if you think of things that need to get done, keeping the sort order in mind, so insert them at a good place, not at the end):

minrk commented 5 years ago

CORS should be disabled by default, but it's worth verifying.

minrk commented 5 years ago

5 minutes might be too aggressive to cull. The activity metric is only updated on the Hub periodically when checking in with the proxy, so this needs to be less frequent than the proxy-check-routes interval

Utilize JupyterLab by default with the BinderHub

Setting c.Spawner.default_url = '/lab/ should be all that's needed for this, I think.

consideRatio commented 5 years ago

Would it function with provided "start this file on startup" as well?

minrk commented 5 years ago

Ah, no. You'd still need to use 'url to open' instead of 'file to open'

minrk commented 5 years ago

.We can teach binderhub about jupyterlab's file-open urls now that they exist. Then we can have an option on binderhub to build /lab/tree/... urls instead of /notebooks/....

consideRatio commented 5 years ago

@betatim @minrk should we go for lab or not even without the fileopen support?

minrk commented 5 years ago

I wouldn't do it today, no.

consideRatio commented 5 years ago

Deployment oversight guide

The conference is in Montreal (GMT-5). And things may be relevant to use already in the morning.

Technical knowledge to understand things better

Below is the planned operations on the deployment during the conference

Monday night, current

We now have 5 GPU potentially available and a single user-placeholder ensures we have at least one quickly accessible.

Tuesday morning

At 6 AM @consideRatio does the following ( 12:00 Swedish time )

  # Scale placeholders to 15
  kubectl scale -n neurips sts/user-placeholder --replicas 15
  # Update jupyterhubs upper limit of users to 100
  hub:
    activeServerLimit: 100
betatim commented 5 years ago

This is a great plan/schedule!

For the point on pre-pulling images: I think you mean the right thing but the way it is written is ambiguous so my two cents: you need to look at the name of the image that repo2docker creates for that particular repository. This is different from the name of the pod/image used by any pod running. There are two ways to see the image name:

consideRatio commented 5 years ago

ah yepp, i was thinking of the -o yaml as well, in practice:

kubectl get pod -o yaml <podname> | grep "image: gcr"

And as you suggest I can also look here, this is easiest to be confident about: https://console.cloud.google.com/gcr/images/binder-prod?project=binder-prod