Closed jzf2101 closed 5 years ago
yes, feel free to use the binder account, but please be mindful of the $$$ you burn through :-)
I'm happy to do the initial setup of the GKE cluster and node pools etc, but I would need to be added to the google cloud project it should be done in if that is alright.
Correct me if I am wrong, but unless pods have the ability to share a GPU in GKE, we will face issue when running more containers than we have GPUs. This potentially limits the scalability of Binder in these circumstances.
I guess as long as it is only for a demo, this is fine, but if we want people to try it on their own, we might need to setup a generous cluster.
@cmd-ntrf based on #81 we are planning on limiting the scaling to put a ceiling on the number of concurrent users @yuvipanda mentioned we could possibly configure the spawner not to automatically scale
@jzf2101, the cheapest gpus will do right? they are still VERY powerful GPUs (nvidia tesla k80).
I need to tie the GKE cluster to a region and somewhat more loosly to a zone/datacenter and each zone has only certain GPUs available. I'll choose a zone where k80s and p100s are available, as compared to one with v100 only are available - ok?
Note that k80 < p100 < v100 < TPUs in price and performance, but a k80 is still a 1000+ USD graphicscard or similar to this.
Yeah I figured that we could give people 1 K80 ideally since the demo is in eastern Canada somewhere in eastern Canada or eastern US would work? Perhaps we should give them the option to use a K80 or not so we don't have to use a K80 every time.
For GPU fanciness, Minimal Viable Product is totally fine in this case, IMO. We're just giving people a proof-of-concept here, don't need to mine bitcoin for the whole conference :-)
@consideRatio nice to have you helping out :) Let me know if you need to know something about the current setup in the GKE project.
The cost calculation we did was based on having one n1-standard-2
instance per pod and giving each instance a K80. As we will not auto-scale the cluster I think having one node per pod (or user) with one GPU attached to it will simplify things for the demo.
(I think in the long term (and "at scale") we'd want a setup where the node your notebook runs on doesn't actually have all that many resources and instead you use a library/tool to create worker Job(s) that run on a GPU enabled instance. Similar to how kubeflow works and dask. But this is something to discuss after the demo.)
@betatim ah i appreciate the insight about kubeflow, really wanted to learn more but have not done it yet. I know there is work done at making request to fractions of GPU, and ive heard timeframes of 6 months, but im thinking 12 months in practice. I think it will be best to use scheduled jobs for the gpu though no matter what as the workload tend to be bursty for ML purposes, and the GPU memory usage may peak hard during usage so using request/limits may be too simple solution to a complex problem anyhow.
@jzf2101: k80 gpu, eastern us, profile_list with cpu and gpu option provided.
Who points what domain to the IP later btw?
we can point a sub-domain of mybinder.org to the deployment (e.g. neurips.mybinder.org and hub.neurips.mybinder.org)
We don't need that for initial setup though correct?
@jzf2101 nope no need for that during the initial setup
Creating a regional cluster named neurips in the binder-prod google project. It will have nodes only on the zone us-east1-d
. The only other option would been us-east1-c
assuming we want us-east
and a zone with NVIDIA Tesla K80 GPU nodes.
(It is a good practice to always go for regional clusters and native VPC according to google k8s experts.)
So wait us-east1-d
doesn't have K80s?
Ooops i meant "the only other option", we have k80 on the current cluster. Will try get time to install binderhub as well tonight
Is there a repository already that has the configuration/deploy/helm chart in it?
Status update- we have a binder hub up now and I can start putting images on it, but if we don't have k80s associated how doe we know that the repos I put in with CUDA work?
Hey all - a quick ping on this one. Do we have something working at mybinder.org that can run CUDA? I believe that @consideRatio got this to work, but please let us know if there are any blockers on this one that we can work through!
@choldgraf I think there's been chatter on the gitter channel about what's going on. From what I recall, I think we're waiting for an increase in quota from GCP. @consideRatio has requested it but we haven't gotten it yet. We have 1 K80 and 100 preemptible ones - I'm assuming they're also K80s?
Also I think @cmd-ntrf and @choldgraf need to be added to the repo.
We now have 20 k80 GPU standard GPUs, and 100 preemptible ones. Work still remains, and I'll spend time in between 19-01 tonight Swedish time when I'm off work.
This comment serves as a todo list for the deployment (for repositories to run on the deployment use a different issue/comment). The list is sorted (top to bottom) from "absolutely needed there will be no demo without this" to "nice to have" to "let's try this out for extra credit". If you have a moment to contribute please work on things that are at the top of the list, when you check them off edit this post to add a link to where you performed that change, don't tackle things further down if there are open items towards the top.
Currently deployed: https://github.com/consideRatio/neurips.mybinder.org-deploy
To do list (please edit it if you think of things that need to get done, keeping the sort order in mind, so insert them at a good place, not at the end):
BONUS:
CORS should be disabled by default, but it's worth verifying.
5 minutes might be too aggressive to cull. The activity metric is only updated on the Hub periodically when checking in with the proxy, so this needs to be less frequent than the proxy-check-routes interval
Utilize JupyterLab by default with the BinderHub
Setting c.Spawner.default_url = '/lab/
should be all that's needed for this, I think.
Would it function with provided "start this file on startup" as well?
Ah, no. You'd still need to use 'url to open' instead of 'file to open'
.We can teach binderhub about jupyterlab's file-open urls now that they exist. Then we can have an option on binderhub to build /lab/tree/...
urls instead of /notebooks/...
.
@betatim @minrk should we go for lab or not even without the fileopen support?
I wouldn't do it today, no.
The conference is in Montreal (GMT-5). And things may be relevant to use already in the morning.
user-placeholder
pods, they ensures we we scale up in advance so the users dont have to wait about 7 minutes. If we think there wont be more than 15 people arriving in 7 minutes, we could should reasonably have 7 placeholder pods.activeServerLimit
, it will not even attempt to spawn a new pod if there is too many spawned already, this could be nice as the hub wont know about the upper cluster autoscaler limit and fail ugly if we have it to an unlimited amount. We should let this be the same as the cluster autoscaler limit.Below is the planned operations on the deployment during the conference
We now have 5 GPU potentially available and a single user-placeholder ensures we have at least one quickly accessible.
# Scale placeholders to 15
kubectl scale -n neurips sts/user-placeholder --replicas 15
# Update jupyterhubs upper limit of users to 100
hub:
activeServerLimit: 100
kubectl get pod -o yaml <podname> | grep "image: gcr"
# or verify things here: https://console.cloud.google.com/gcr/images/binder-prod?project=binder-prod
prePuller.extraImages.demo1.name / tag
etc. and re-deploys the chart with its new configuration
# Scale placeholders to X
kubectl scale -n neurips sts/user-placeholder --replicas X
# remove the placeholders
kubectl scale -n neurips sts/user-placeholder --replicas 0
# remove the actual users
kubectl delete pod -n neurips -l component=singleuser-server
This is a great plan/schedule!
For the point on pre-pulling images: I think you mean the right thing but the way it is written is ambiguous so my two cents: you need to look at the name of the image that repo2docker creates for that particular repository. This is different from the name of the pod/image used by any pod running. There are two ways to see the image name:
kubectl describe
the jupyter-...
pod that gets launched after the build succeedsah yepp, i was thinking of the -o yaml
as well, in practice:
kubectl get pod -o yaml <podname> | grep "image: gcr"
And as you suggest I can also look here, this is easiest to be confident about: https://console.cloud.google.com/gcr/images/binder-prod?project=binder-prod
We need to set up at JupyterHub with CUDA and then turn it into a binderhub
See https://github.com/jupyterhub/team-compass/issues/52 and https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/994#issue-373992464
@choldgraf @betatim @aculich @consideRatio @cmd-ntrf
Based on #81 we could use the binder account?