pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
703 stars 189 forks source link

Documentation for xarray / dask.distributed / Google Cloud Platform #16

Closed jhamman closed 6 years ago

jhamman commented 7 years ago

We need some internal documentation for how to get our "platform" running on GCP. This documentation should make it straightforward for a user familiar with HPC platforms to

cc: @jhamman, @mrocklin, @rabernat


Question for @mrocklin. @rabernat and I have been thinking that the dask-kubernetes repo is a good starting point for launching the pangeo platform on GCP. At least for a first cut. Do you have any reservations to that end.

I was able to launch a small dask-kubernetes cluster this evening and open up a Jupyter notebook. Due to limitations in the free trial, I'll need to revisit this after our project has more GCP credits.

mrocklin commented 7 years ago

I've quite enjoyed using dask-kubernetes. It's my default cluster setup. cc @martindurant, who might be interested in this conversation.

mrocklin commented 7 years ago

We'll probably want to make our own dockerfile

rabernat commented 7 years ago

As long as we are making a wishlist, it would be great to have some sort of web-based "user gateway" which allows one to launch both a jupyter notebook and a distributed cluster without having to go to the console at all.

martindurant commented 7 years ago

Yes, agree - most of the functionality you are after is already in dask-kubernetes: create dask cluster, provide scheduler address, start jupyter and expose. So what's really required is an image which includes xarray and the xarray-specific notebooks. It might be reasonable to store the data in GCS.

@rabernat , we do have such a gateway server which we use for dask tutorials (this is separate from dask-kubernetes) to give users pieces of a larger google cluster, and maybe could be adapted. Where would you imagine hosting this? Or did you mean something else?

mrocklin commented 7 years ago

If we have this open long term we would need to have some way to manage users, This can get a bit tricky.

rabernat commented 7 years ago

we do have such a gateway server which we use for dask tutorials (this is separate from dask-kubernetes) to give users pieces of a larger google cluster, and maybe could be adapted. Where would you imagine hosting this?

Exactly what I had in mind. We could host it on a lightweight VM on GCP.

If we have this open long term we would need to have some way to manage users.

Yes, the idea is probably premature at this stage. But eventually I would like to grant access to a wider user base than just our proposal team members (for example, other scientists at NCAR and GFDL) to allow them to test drive the cloud system. Would be ideal to outsource authentication and user management to github...

jhamman commented 7 years ago

cc @atlehr who has recently been setting up autoscaling Jupyter hub instances using Kubernetes on GCP.

jhamman commented 7 years ago

@mrocklin - I have an updated docker image for the pangeo project and a fork of the dask-kubernetes repo. I'm able to launch the cluster and connect to the jupyter notebook, but the dask scheduler isn't starting for some reason. Any chance you can take a look sometime this week?

Links to docker image and docs are in a draft wiki page.

martindurant commented 7 years ago

I am glad to see development linked to dask-kubernetes! I have had a brief look through your branch, and things seem to be generally in order at a glance. There are really two main changes: the docker image (environment, password, example notebook), and configuration. I wonder, are you planing to fork this, or to PR back into dask-kubernetes? I could imagine the being more than one set of provided configurations supplied, one for the established "typical use" and one for pangeo - the user could choose the base, and override any parameters as needed. Multiple docker images can also be hosted in the same repo, as dockerhub builds from an image at a specific location in a repo. This is just a suggestion, because it is probably less work to maintain one slightly more complex repo than two slightly simpler similar ones.

mrocklin commented 7 years ago

I think that it should be (and is?) fairly easy to just define different Dockerfile and kubernetes.yaml files for different environments. I don't think that dask-kubernetes should be in the business of holding environments for particular user groups, both because there may be a few of these in the future, and because we don't want them to be dependent on us to update their own environment.

Is there a natural place for people to put these files. Presumably dockerhub for the docker files. Is there an obvious place to register kubernetes specs? Or perhaps we include the kubernetes yaml file in the wiki instructions? In that case instructions might read.

  1. Install dask kubernetes

    pip install dask-kubernetes
  2. Download [this kubernetes spec]()
  3. Create a cluster using this spec

    dask-kubernetes create my-cluster path/to/myfile.yaml
mrocklin commented 7 years ago

@jhamman my first guess is that you have allocated too many pods for your workers. I recommend opening up the kubernetes dashboard to see if there are any errors popping up.

dask-kubernetes info cluster-name
mrocklin commented 7 years ago

Another option here is that we could deploy a minimal deployment with dask-kubernetes and then improve Dask's ability to bootstrap itself with a conda environment. We used to have the ability to restart with an environment; we could restart this.

jhamman commented 7 years ago

Thanks @mrocklin. I think I'll come back to this after we have $$$ in the pangeo project. I'm thinking I'm asking for more than our quota allows right now.

Thinking (out-loud) about a useful interface for this, it would be nice to have something like:

What would it look like to control the kubernetes cluster configuration from the jupyter hub level. Maybe that is not practical. If that is not possible, it seems pretty straightforward to have a bunch of prebuilt docker images that can be specified when launching dask-kubernetes.

mrocklin commented 7 years ago

Dask issue about deploying conda environments: https://github.com/dask/distributed/issues/1457

mrocklin commented 7 years ago

I know that we had to ask Google to expand our quota at some point. They were happy to do so. I think that this action actually triggered some sort of process at their end where we got a nice phone call and they asked some engineers to take a look at what we were doing. We can probably get some help from their end if we ask politely.

@rabernat do we have a GCE accounting project that we can work on? Do we have a contact at Google that can help us with some of these things?

rabernat commented 7 years ago

Yes, we have have been in touch with the people at Google. We have a project created, and they are in the process of transferring our NSF-sponsored credits to it. Hopefully this will happen this week.

rabernat commented 7 years ago

p.s. @mrocklin just invited you to the GCP project

mrocklin commented 7 years ago

@minrk do JupyterHub folks have any interest in collaborating here?

This is an NSF-funded collaboration between NCAR, Columbia University, and Anaconda Inc (formerly Continuum) to build infrastructure to help atmospheric and oceanographic scientists analyze large datasets with XArray/Dask. On every launch of a notebook we would also launch a cluster on Google container engine. We might want to use the user's credentials when launching this cluster to handle billing issues.

amanda-tan commented 7 years ago

Out of curiosity, do you anticipate that most of your users will be using dask distributed? Is it actually necessary to have a separate Jupyterhub since it looks like dask-k8 does the provisioning of the notebook already? Are you thinking that Jupyterhub can be used as an authentication mechanism?

jhamman commented 7 years ago

do you anticipate that most of your users will be using dask distributed

@atlehr - At this point, yes. Most everyone will be using dask-distributed. Having JupyterHub act as the gate keeper would be a nice way to abstract away the kubernetes piece.

minrk commented 7 years ago

@mrocklin yes! In particular, we want to be better able to allow users to deploy applications on Kubernetes from their notebook servers, especially when deployed from the JupyterHub helm charts. The pieces seem to be:

  1. deploy the appropriate credentials into the notebook containers, so that they have sufficient restricted access to deploy things, but not sufficient to mess with the rest of the cluster (This would go in KubeSpawner). This may mean giving each user a namespace and confining them to it.
  2. Probably a Python client API for doing the deployment and returning appropriate connection info. You may already have most of this with dask-k8 for dask in particular, but we've been thinking of more generic kubernetes/helm deployments from the notebook for a bit.

@yuvipanda has more specific ideas on how to accomplish this, I think.

mrocklin commented 7 years ago

@rabernat are we planning to start by giving particular people access to our bucket of hours on GCE or by allowing anyone to come with their own billing project and use their own funds?

@yuvipanda , @minrk , no one on this project has particular experience with JupyterHub or building GCE applications. Can you recommend any projects out in the wild from which we might want to steal-and-adapt?

yuvipanda commented 7 years ago

Thanks for tagging me in, @minrk. I'm working on enabling exactly this sort of use case for spark, tensorflow & dask in kubespawner - https://github.com/jupyterhub/kubespawner/issues/79 is the appropriate issue where me & some folks from the spark community are discussing this. It'll give each user their own namespace + permissions only for that namespace, and allow tools like this to work great. About half of it is implemented already, and I'll start on the other half once https://github.com/jupyterhub/kubespawner/pull/64 gets merged.

@mrocklin Have you checked out z2jh.jupyter.org? That's our guide + docs on setting up JupyterHub on kubernetes, with instructions for google cloud too. You could steal + adapt from there, but if what you want is a JupyterHub on a kubernetes cluster with access to common data + dask running on that, we'll happily work on making that possible by customizing the z2jh project.

jhamman commented 7 years ago

@yuvipanda - thanks for sharing. It sounds like the features you are describing are exactly what we need.

I think in the short term, dask-kubernetes should suffice to get the Pangeo project up and going. After a few months, if we can utilize and/or contribute to parallel development that is happening with jupyterhub/kubespawner, that seems like a logical long term development path. (This could happen sooner if someone else wants to take it on but I think its outside of my scope right now.)

mrocklin commented 7 years ago

@jhamman I agree that this is probably outside of your short-term scope. That being said, I suspect that we might be an interesting use case for @yuvipanda 's work. It might make sense to engage in dialog early to make sure that we can benefit from this work in the future.

My guess is that the next useful step here is for us to spec out in more detail the kind of system that we want to achieve long term so that @yuvipanda has a more precise understanding of what our needs are. I have thoughts on this but I suspect that there are small differences between my perspective, @rabernat's and @jhamman's .

I'll try to write up something with more detail later today.

rabernat commented 7 years ago

Let's assume for now that the result of our benchmarking will show that commercial cloud platforms are the optimal place to store and analyze large climate datasets (compared to HPC clusters or local servers). In that case, we will be in a position to lead the migration of this a large scientific community to the cloud a few years from now.

What would a futuristic system open to the whole community look like? We would need a gateway for users coming from many different organizations, linked to different billing accounts (possibly via an NSF-Google partnership), to easily launch and manage Jupyter notebooks and dask clusters. I imagine a simple web interface where you could pick the cluster parameters, possibly resize interactively, etc. Would be even better if this could be integrated directly into JupyterLab via extensions. There would be different data stores that the user could connect to (possibly dependent on credentials / permissions).

mrocklin commented 7 years ago

Let's assume for now that the result of our benchmarking will show that commercial cloud platforms are the optimal place to store and analyze large climate datasets (compared to HPC clusters or local servers)

Just to be clear here. I suspect that they will be a more convenient, but not optimal from a performance perspective. Communication bound computations (which I think are not uncommon in your space) are likely to be significantly faster on fast network.

I don't think this is a relevant point for this discussion. I think that convenience will be more motivating than performance. We should go into this with that firmly in mind though.

rabernat commented 7 years ago

To clarify, I'm talking specifically about data analysis, not simulation (i.e. climate modeling). Simulation will continue to be compute and communication bound and will require traditional supercomputers and Fortran for the foreseeable future. For "big data" analysis (i.e. making sense of what comes out of the simulation), this is frequently I/O bound.

mrocklin commented 7 years ago
  1. User goes to a publicly visible website
  2. They are logged in and authenticated through some external system (Google, OAuth, GitHub)
  3. They provide specifications of a cluster
    • In the common case this will just be a number of nodes, but might also include more information, docker files, etc..
    • This could be a point-and-click interface through some JLab extension
    • This might also just be a command they run in a Jupyter notebook
  4. We launch that cluster for them and deploy dask on it
    • This might be billed to their attached Google billing account
    • We might also want to allow certain users to use our own accounts, though with restrictions. We have a bucket of hours from NSF/Google that we would like to use to enable a set of scientists. We would want to avoid any particular group monopolizing these hours. It may be that we have to handle this through Google compute project permissions.
  5. They connect to a notebook and run an XArray/Dask workload pointed at that Dask scheduler
    • This notebook might be hosted locally on our JupyterHub launched session
    • Or this notebook might be running on the cluster that we have set up for them
  6. They save their notebooks
    • This might be to a user session on our JupyterHub server
    • Or this might be a download onto their local machine
  7. If they are idle for a long time then we log them out and clean up the attached cluster to avoid excessive billing
amanda-tan commented 7 years ago

Is it possible to run the Jupyterlab extension with the Jupyterhub-Kubernetes implementation? @yuvipanda @minrk

minrk commented 7 years ago

@atlehr yes, you pick the image that users will launch into. Installing JupyterLab and any extension in the image is the same.

Then a little configuration lets you launch users into lab instead of nbclassic:

c.Spawner.default_url = '/lab'
jhamman commented 7 years ago

@rabernat or @mrocklin - Do either of you know how to attach a read-only persistent disk to each node instance in the dask-kubernetes cluster?

We now have two PDs loaded up with some data and our own dask-kubernetes branch (https://github.com/pangeo-data/dask-kubernetes) and docker image (https://hub.docker.com/r/pangeo/dask-kubernetes/).

mrocklin commented 7 years ago

Unfortunately I don't have any experience with attaching actual disks to GCE nodes. I can ask around though if desired.

mrocklin commented 7 years ago

@martindurant would anyone in Anaconda Inc's platform team be good to ask here?

rabernat commented 7 years ago

I’m pretty sure the answer is in here: https://kubernetes.io/docs/concepts/storage/persistent-volumes/

Sent from my iPhone

On Oct 21, 2017, at 12:01 PM, Matthew Rocklin notifications@github.com wrote:

@martindurant would anyone in Anaconda Inc's platform team be good to ask here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jhamman commented 7 years ago

Right, I tried something like that (https://github.com/jhamman/dask-kubernetes/commit/02d5c35777549d0671b95a8f78361c113b7477ad), but that didn't seem to work. I've run out of time on it for this week but will have some time to play with it more next week. If anyone else tries it out before then, keep me posted.

rabernat commented 7 years ago

Perhaps we should get help from GCP support.

jhamman commented 7 years ago

@rabernat - I just asked the dask-kubernetes developers if they have any experience in this area. If that comes back as a no, let's pursue the GCP support avenue.

mrocklin commented 7 years ago

It's probably worth engaging Google regardless. They should know what we're up to. My experience is that they're more than happy to dedicate some internal engineering time to help people get off the ground.

rabernat commented 6 years ago

@mmccarty: thanks for your help spinning up the cluster. It would be great to have some basic docs on how we can start / stop this environment. Such documentation would live in the docs/setup_guides folder of this repo. The documentation is automatically built and posted at https://pangeo-data.github.io/pangeo/.

mmccarty commented 6 years ago

Great! Would someone please add me to this organization so I can push the updates?

rabernat commented 6 years ago

I will add you. But please make a pull request rather than a direct push.

Sent from my iPhone

On Nov 20, 2017, at 9:23 AM, Mike McCarty notifications@github.com wrote:

Great! Would someone please add me to this organization so I can push the updates?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mrocklin commented 6 years ago

I enjoyed reading about BinderHub which was released today. I'm curious if we can leverage something like this to remove our need to maintain a long-running service (or at worst maintain a stock version of BinderHub).

So lets imagine that we make a github repository that has a custom environment, probably with a few JLab extensions, and a notebook plugin to read and write notebooks to GCS.

Pangeo-folks, is this workload sufficient? It would be a bit odd in that we would separately launch a Dask cluster on an entirely different deployment (this might take a few minutes). It's a bit nice in that it separates out billing concerns and that it stops us from having to tend a long-running cloud deployment, which, given our skillset distribution, would be welcome.

Jupyter/BinderHub folks, is this feasible?

choldgraf commented 6 years ago

Yo - a few thoughts:

Does BinderHub allow us to authenticate users by their Google credentials and then use those credentials within their session? Or do we have to customize and go to a more general JupyterHub deployment?

The public mybinder.org service does not, but binderhub is definitely deployable on other cloud services for whatever purposes you wish. We're treating mybinder.org as a tech demo, but if you want more custom setups (like authentication or better hardware) you can def deploy a BinderHub on whatever kubernetes setup you've got. we'd love to see this start happening.

Are there any restrictions on the environment we create? For example could we have custom JLab extensions?

Nope, not really. There are some restrictions on things like egress and memory limits on the public binder service, but you can control that however you wish if you have your own deployment.

Where does the current public instance of BinderHub live? (I'm assuming that we would want to bootstrap on this)

We're running it on GCP, though it is a kubernetes app so it can be deployed on whatever service that can run kubernetes.

LMK if you have any other questions/thoughts!

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jhamman commented 6 years ago

closed via: http://pangeo-data.org/setup_guides/cloud.html