rabernat commented 6 years ago

As we discussed at the developer meeting, the time may have come to refactor pangeo.pydata.org.

Let's discuss how we would like our future services to look. I think we are all in agreement that, for non-persistent demos, we definitely want to move to @jhamman's binder.pangeo.io solution.

But what about the "production" services. There are some people (including myself, @chiaral, @chiara7, and who knows who else) who are using pangeo.pydata.org for actual science.

I personally really like @mrocklin's idea of refactoring pangeo.pydata.org into several more specific services. For example:

ocean.pangeo.io: all of my high resolution ocean models and physical oceanography stuff
cmip.pangeo.io: focused on CMIP5 / CMIP6 / large ensemble climate models
forecast.pangeo.io: atmospheric reanalysis and forecast
hydro.pangeo.io: @jhamman's and @rsignell-usgs's hydrologic models and datasets
solar.pangeo.io? astro.pangeo.io?

Each cluster could have a distinct:

environment with domain-specific packages
set of examples
data catalog (although all data would be available to all clusters)
dedicated administrator

The advantage of this approach is that it might catalyze collaboration around specific scientific problems. But in order for that to happen, we would still need a way to make it easier to share examples with the group.

mrocklin commented 6 years ago

But in order for that to happen, we would still need a way to make it easier to share examples with the group.

When we interact with a new commuity we'll be giving them a cookie-cutter repository template. That repository will have an examples directory that they will be encouraged to fill with notebooks. Given how well this has worked in the original deployment I expect that other communities will continue this tradition. As long as this is in a uniform place (which is likely given that we're providing a cookie cutter) then presumably we can scape these directories and place them into either documentation, or provide a binder-able version of the same repository.

rabernat commented 6 years ago

Good points Matt. But I worry that the complexity of adding new notebooks to the examples repo will exclude some users. Currently that process is something like:

Copy an existing example to a new location (this is how most people will probably start their work)
Modify it into something new that you want to share
Download the notebook by right-clicking in the file browser
Create a PR to the appropriate example-notebooks repo
Wait for PR to be merged
Users have to start their jupyter pod again to see the examples refresh

My understanding is that users should not use git to push directly from the cloud environment because it is not secure. But maybe I am wrong about this. Perhaps we can figure out a way to forward the git login credentials from the hub (@minrk mentioned something about this in his JupyterCon talk). Then at least they could avoid the download step.

I miss the good old days when I had a gist button in my jupyter notebooks. This was such an easy way to share what you were working on. It would be nice to aim for that sort of one-click shareability.

martindurant commented 6 years ago

Pushing to a forked copy of the examples should work directly from the the container, but it'll ask for github username/password, which should be fine.

rabernat commented 6 years ago

which should be fine.

it would work, but it is not an optimal user experience. From the user perspective, they have already authenticated with git. Why should they have to do it again? Plus copying / pasting passwords on the command line is high friction.

I know some of these things are complicated and perhaps impractical. But I think it's worth discussing a vision for where we would like to go.

minrk commented 6 years ago

My understanding is that users should not use git to push directly from the cloud environment because it is not secure. But maybe I am wrong about this.

It depends on what you mean by "not secure." It's certainly "less secure" in that it's adding more people and systems to trust with your credentials. Storing credentials at rest is entrusting those running the service (you folks) to do so securely, and not abuse their access. For instance, if you enable auth state to store a GitHub API token and pass it down to the user environment, this is stored in the jupyterhub database in an encrypted form, but an admin of the JupyterHub deployment has sufficient access to decrypt this data (kubectl exec -it privileges on either the hub pod or user pod is enough). So it's a best effort to prevent malicious access, but users are not protected from trusted administrators by anything other than a given deployment's policy.

one-time login with username and password at push time as @martindurant mentions doesn't require trusting the hosting provider as much, so it's the usual tradeoff of trust vs convenience. It should be possible to ensure that at least the username is set, so only the password needs typing.

I miss the good old days when I had a gist button in my jupyter notebooks.

The gist button could be reworked to be a server-side action, in which case it could inherit the same credentials in the env, rather than being a purely javascript extension. The user experience could still be the same button, but get rid of the fiddly token management, and use the same setup as the gist gem.

martindurant commented 6 years ago

Is this the same button as before, or something new? I assume that in this case, the user's browser takes care of auth.

martindurant commented 6 years ago

^ actually, that specific repo says it doesn't work with the new auth system of gist, so never mind.

jhamman commented 6 years ago

I think it would be really nice if we could define a "kubernetes-pangeo-jupyterhub" with something like terraform. This would help us provide specification for a kubernetes cluster as well as the helm chart stuff. I know @jonahjoughin / @aaarendt were playing around with this at some point. It would obviously be super nice if new pangeo deployments were as simple as:

cd pangeo
terraform init
terraform apply

This is not something I have a lot of experience with but streamlining the (re)deployment with something like this seems like an obvious win for us non-devop kinds.

guillaumeeb commented 6 years ago

I think it would be really nice if we could define a "kubernetes-pangeo-jupyterhub" with something like terraform

👍 on this! I'd be happy to help, but not sure I can find the time.

jacobtomlinson commented 6 years ago

You may find this useful. We create our Kubernetes cluster on AWS using a combination of Terraform and Kops.

jhamman commented 6 years ago

Two organizational thoughts here:

We need a way to determine if a related project is a good fit for pangeo. This fits in with the general governance discussions we're having. What are some criteria that we think would indicate a successful partnership?
We need to be clear about funding, support, and involvement. Specifically, if using Pangeo GCP credits, the time horizon is ~2 years. Perhaps we should review how this is going periodically. Support would mainly come from the domain team with help as possible from the community.

mrocklin commented 6 years ago

Operationally my original thoughts for how this would be restructured was as three pieces

A cluster on GCP running Kubernetes that has several JupyterHub deployments on it, one for each of the disciplines like ocean, atmo, hydro, solar, etc..
A github repository for each of those disciplines that has a few things:
- A directory that follows the repo2docker specification. This docker image gets used both for the single-user notebook image, and for the dask-worker image. This repository is managed by the people in charge of that discipline.
- A JupyterHub config file to specify other useful bits, like admin access and so forth
A CI/CD system that automatically redeploys the JupyterHub deployment whenever a change to master is pushed
A master repository that includes a list of which repositories we care about

To move forward on this I think we need to push on the CI/CD part. I suspect that that will force a conversation about what each domain-specific repository will have to specify.

@yuvipanda , you mentioned that you had done some work on JupyterHub deployment. Is that hubploy https://hubploy.readthedocs.io/en/latest/ ?

NicWayand commented 6 years ago

Just jumping in to support the domain specific refactor and suggest a seaice.pangeo.io: I would like to use it for my sea ice prediction research and allow our SIPN2 participant modelers to use it for theirs. Would this fit within Pangeo? I can see getting more sea ice researchers onboard with the domain specific approach. Happy to help set up/test a seaice.pangeo.io, but need guidance on where to start (I assume waiting for a cookie-cutter repository template?).

mrocklin commented 6 years ago

@NicWayand yes, you're exactly the sort of person we would be targetting. Someone who would be happy to do some work to curate an environment, examples, and datasets and socialize them within a new community, but probably won't go through the effort of setting up your own Dask/XArray-enabled JupyterHub deployment.

mrocklin commented 6 years ago

From private conversation with Yuvi he mentioned that his recent work on deployment lives here: https://github.com/berkeley-dsep-infra/datahub

yuvipanda commented 6 years ago

https://hubploy.readthedocs.io/en/latest/ is the tool used to do the deployments. You can see how CircleCI is used to automatically:

Build Docker images
Deploy hubs

as an example at https://github.com/berkeley-dsep-infra/datahub.

I think Pangeo is a great use case for pushing hubploy development forward! :D

mrocklin commented 6 years ago

So @dsludwig has some time to take a look at this. I think that he probably needs an example repository that a scientific group leader would create. This will probably need both a repo2docker compatible repository as well as some config information to provide to JupyterHub. I imagine that we'll have to iterate on this a bit before we get it right. Do we have something that would be a good starting point? I'm thinking that one of the Pangeo-Binder examples might be useful. cc @jhamman

mrocklin commented 6 years ago

I imagine that alongside a repo2binder compatible repository we'll also need a jupyter-config.yaml file, like https://github.com/pangeo-data/pangeo/blob/master/gce/jupyter-config.yaml (though presumably a lot of that, like the privileged containers and fuse mounts, can be ripped out.)

jhamman commented 6 years ago

I would be happy using a fork of https://github.com/pangeo-data/pangeo-example-notebooks as a place to prototype this work. How about:

Use the Binder stuff I've already done
Move the notebooks to a subdirectory (say /notebooks)
Make a juypyterhub directory where we can put the configs

mrocklin commented 6 years ago

That sounds sensible to me. @jhamman would you mind playing the role of "model science group leader" and constructing a repository that has the information that you think you should have to provide?

On Fri, Sep 7, 2018 at 12:58 PM, Joe Hamman notifications@github.com wrote:

I would be happy using a fork of https://github.com/pangeo- data/pangeo-example-notebooks as a place to prototype this work. How about:

Use the Binder stuff I've already done

Move the notebooks to a subdirectory (say /notebooks)

Make a juypyterhub directory where we can put the configs

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/373#issuecomment-419502503, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszECuqsq6St9SBjcg0lR3YSNDolVYks5uYqXAgaJpZM4WN-_z .

jhamman commented 6 years ago

I think just would want to share this: https://github.com/pangeo-data/pangeo-example-notebooks/tree/pangeo-refactor

As a group leader, I'm hoping that (nearly) all of the JupyterHub configuration is done upstream. I just want to provide:

my example notebooks
my environment file
my computation configs (dask configs in this case)

mrocklin commented 6 years ago

I think that you'll also want to include a small JupyterHub config file that lists names of github users or orgs that you'd like to see given access

On Fri, Sep 7, 2018 at 1:08 PM, Joe Hamman notifications@github.com wrote:

I think just would want to share this: https://github.com/pangeo- data/pangeo-example-notebooks/tree/pangeo-refactor

As a group leader, I'm hoping that (nearly) all of the JupyterHub configuration is done upstream. I just want to provide:

my example notebooks

my environment file

my computation configs (dask configs in this case)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/373#issuecomment-419505112, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszEEqUpiRBSossEDaDcheCs4k9Mp1ks5uYqf2gaJpZM4WN-_z .

rabernat commented 6 years ago

I am happy to see this moving along and am eager to contribute.

I personally would like to see ocean.pangeo.io be developed. However, I do not have the bandwidth right now (start of semester) to be the point person on this. (I'm hoping @raphaeldussin can get involved when he gets back from vacation.) So I support the idea of using @jhamman as a guinea pig group leader.

A couple of technical points:

When this is done, will we still need a generic pangeo helm-chart? If you look at the actual chart, it is extremely thin. Perhaps this can be completely factored out.
We still need to pursue the idea of a data catalog, which I believe will really enhance the "group" experience. I hope to have an intern working on this this semester.

dsludwig commented 6 years ago

I've started on setting up an example deployment configuration here: https://github.com/dsludwig/example.pangeo.io-deploy based on the notebooks provided by @jhamman using https://github.com/yuvipanda/hubploy.

One change I've made is using repo2docker to build the image instead of a simple Docker build, with the idea that would be easier for a group leader to create their desired environment.

Unfortunately I seem to have run into some issues with getting authentication working: https://circleci.com/gh/dsludwig/example.pangeo.io-deploy/6

@yuvipanda would you be able to help me figure out how to resolve it?

rabernat commented 6 years ago

Now that pangeo-binder is more-or-less deployed, I am eager to move forward with this.

I would like to be involved in setting up ocean.pangeo.io. Hopefully @raphaeldussin can get involved as well.

raphaeldussin commented 6 years ago

@rabernat sure!

yuvipanda commented 6 years ago

@dsludwig awesome! I've two questions:

what rights does the service account present in gcloud-service-key.json have?
have you tried making the image public: https://cloud.google.com/container-registry/docs/access-control#serving_images_publicly?

I would recommend trying to debug this with the following:

Manually push something to pangeo-181919/example-pangeo-io-notebook:latest. Can be anything!
make sure the the container is available publicly
try re-running it.

I think this is a change that needs to happen in hubploy, where it can't differentiate between actual 'internal service error' and just 'this container image does not exist at all' vs 'this container image does not exist at this revision'. If following the three steps above fixes it, that'll give us a clear path forward on hubploy.

Thank you for helping test hubploy! <3

dsludwig commented 6 years ago

Thanks @yuvipanda, that was enough to get me unblocked.

What I've got so far:

Cluster deployed using the sample scripts provided in https://github.com/pangeo-data/pangeo/pull/378/files
CI process attached to the repository: https://circleci.com/gh/dsludwig/example.pangeo.io-deploy/16
- when the repository changes, the cluster is redeployed with the config values & image
Image based on the example notebooks provided by @jhamman
- https://github.com/dsludwig/example.pangeo.io-deploy/tree/master/deployments/example.pangeo.io/image
- I'm still working on this: the home directory used by repo2docker conflicts with the one provided by JupyterHub

yuvipanda commented 6 years ago

@dsludwig awesome!

repo2docker tries to put the entire environment (virtualenvs, conda envs, R stuff, etc) in /srv, and only the contents in $HOME. This was an explicit decision, since in most JupyterHub installations $HOME is persistent storage. I'd recommend using repo2docker to construct the image environments, but using something like https://github.com/jupyterhub/nbgitpuller to provide contents of the git repository on the home directory.

dsludwig commented 5 years ago

Thanks @yuvipanda. I ended up getting the home directory populated with an entrypoint, copying content from /srv/home to ${HOME}.

I have a deployed instance that seems to be working correctly. Try it out here: http://104.154.59.98

It's deployed with the notebooks here: https://github.com/dsludwig/example.pangeo.io-deploy/tree/staging/deployments/example.pangeo.io/image and the environment here: https://github.com/dsludwig/example.pangeo.io-deploy/blob/staging/deployments/example.pangeo.io/image/binder/environment.yml

How I imagine this working for the different deploys:

group leader will clone/fork/cookiecutter a new repository for each deploy
pangeo representative will create a new cluster & service account with the appropriate permissions and share the account key with the group leader
group leader creates a CircleCI job for the new repository, fills in the appropriate environment variables
group leader customizes the configuration, examples and environment to fit their requirements

mrocklin commented 5 years ago

It would be good to get @jhamman's thoughts here.

As a next step I recommend that we separate out the deployments/ example.pangeo.io folder to a separate repository and give access to @jhamman to play around with it, commit changes, see results, etc.. He might be able to provide feedback on what feels natural and not.

(I'm volunteering Joe here, but others may also be interested in participating in this)

On Fri, Sep 21, 2018 at 2:29 PM, Derek Ludwig notifications@github.com wrote:

Thanks @yuvipanda https://github.com/yuvipanda. I ended up getting the home directory populated with an entrypoint, copying content from /srv/home to ${HOME}.

I have a deployed instance that seems to be working correctly. Try it out here: http://104.154.59.98

It's deployed with the notebooks here: https://github.com/dsludwig/ example.pangeo.io-deploy/tree/staging/deployments/example.pangeo.io/image and the environment here: https://github.com/dsludwig/ example.pangeo.io-deploy/blob/staging/deployments/example. pangeo.io/image/binder/environment.yml

How I imagine this working for the different deploys:

group leader will clone/fork/cookiecutter a new repository for each deploy

pangeo representative will create a new cluster & service account with the appropriate permissions and share the account key with the group leader

group leader creates a CircleCI job for the new repository, fills in the appropriate environment variables

group leader customizes the configuration, examples and environment to fit their requirements

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/373#issuecomment-423630882, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszP_CD-YpaMjF1kwC9y2R9hhlsl0rks5udS_2gaJpZM4WN-_z .

rabernat commented 5 years ago

@dsludwig this seems great! I think the procedure you have defined sounds reasonable.

It could be useful to brainstorm what are all the ways that the different deployments can be customized. Here are a few:

environment and packages (completely specified by environment.yml)
example notebooks (should ideally be domain specific) (also, what is the procedure for adding / updating examples?)
cluster size and machine type. For example, ocean.pangeo.io. will need a very high memory / cpu ratio for the dask workers (ideally > 16GB per worker core), plus beefy notebook pods (ideally 32 GB or more). This could be tricky because it has to be specified at the cluster creation step.
Custom login page information (image / logo, points of contact, description of resource, etc.)
User whitelist (and possibly oath mechanism, unless we are set on always using github)
Data catalog? Do we want to use separate data catalogs for each deployment? Or have one master catalog? If one master catalog, how do we point users towards the relevant data. (Related to #394)
Logging options. Is grafana enabled by default in the cookiecutter repo? For this iteration, we definitely want to capture metrics from the beginning and make these accessible to the administrators.

I would like to get @raphaeldussin to try to set up ocean.pangeo.io. Raphael is very knowledgeable about a huge range of science and technical issues; however, he doesn't have too much experience with cloud. (In contrast to @jhamman, who is very familiar with the whole stack.) If he doesn't mind being a guinea pig, this will allow us to find the pain points with the deployment procedure.

dsludwig commented 5 years ago

Now that https://github.com/terraform-providers/terraform-provider-google/issues/2022 is solved, we may be able to include the terraform scripts from here in the CI setup. This would allow the cluster admin to configure machine size.

mrocklin commented 5 years ago

@dsludwig yesterday you mentioned that you probably needed to push some changes upstream to a few other repositories (hubploy, repo2docker). Would you mind linking to those PRs here when you submit them?

dsludwig commented 5 years ago

I've submitted https://github.com/jupyter/repo2docker/pull/413 and https://github.com/yuvipanda/hubploy/pull/2

jonahjoughin commented 5 years ago

@jhamman I have a working example of a terraform-deployable cluster on AWS here which @aaarendt has been using for the last week or so. Nothing is running through helm right now, because everything is going directly through terraform's Kubernetes provider, but I've gotten the deployment process down to:

terraform init
terraform apply --target=module.eks
terraform apply --target=module.kubernetes

guillaumeeb commented 5 years ago

@jonahjoughin it seems that your note using Jupyterhub nor dask-kubernetes in your deployment, am I correct?

If I understand correctly, you start one notebook server and one dask scheduler, does your cluster support multitenancy? Does it provide auto scaling features based on the processing load as KubeCluster does with adaptive mode?

jonahjoughin commented 5 years ago

It isn't using JupyterHub or dask-kubernetes for the time being. The primary aim of this project is to make it simple for anyone to set up and tear down a private cluster quickly. Because there isn't any need for multi tenancy, autoscaling is relatively unimportant. The pool of dask workers simply fills up any available space on the cluster, the maximum size of which is determined by the user.

It might be possible to use something like terraform-provider-helm to adapt this to JupyterHub in the future.

mrocklin commented 5 years ago

Checking in here. It looks like @dsludwig has an implementation up at https://github.com/pangeo-data/example.pangeo.io-deploy

There hasn't been much activity around this, which I'm attributing to one of two causes:

Relevant people don't know that this exists
The current instructions are too intense for people to try things out.

@dsludwig I wonder if it would make sense just to go through this process a few times and set up a few repositories at

github.com/pangeo-data/astro
github.com/pangeo-data/ocean
github.com/pangeo-data/hydro
github.com/pangeo-data/sandbox

and set up the CI/CD system for those repositories. Then we can give rights to those repositories to people and have them go to town without having to first step through the process. Thoughts?

rabernat commented 5 years ago

👍 to your plan Matt. Then we can try to get @raphaeldussin started on ocean.

arokem commented 5 years ago

I am planning to try that out, once I find the bandwidth.

On Tue, Oct 9, 2018 at 6:58 AM Ryan Abernathey notifications@github.com wrote:

👍 to your plan Matt. Then we can try to get @raphaeldussin https://github.com/raphaeldussin started on ocean.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/373#issuecomment-428202447, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPNgKwxiFyRuwF89WBbu9eQsJOSbGhks5ujKt6gaJpZM4WN-_z .

jhamman commented 5 years ago

@bartnijssen and @NicWayand are the ones currently responsible for the hydro and polar pangeo deployments. I think @NicWayand is already using the CI deployment tools, @bartnijssen has yet to give that a try (IIRC).

NicWayand commented 5 years ago

I just made a PR with some changes to @dsludwig's instructions here.

jhamman commented 5 years ago

I'm going to go ahead and say we're ready to tear down pangeo.pydata.org.

What are the steps we need to take to make that happen?
Do we want to archive peoples PDs somehow?
Do we want to make a few loud announcements (twitter/github/email) to make sure people are aware of the change?
How do we want to coordinate getting people access to the refactored hubs (e.g. ocean.pangeo.io)?

mrocklin commented 5 years ago

Do we want to archive peoples PDs somehow?

I'm suggest that we don't do this.

guillaumeeb commented 5 years ago

How do we want to coordinate getting people access to the refactored hubs (e.g. ocean.pangeo.io)?

494 could be one of the answer, we should redirect people to domain specific, new hubs or directly binder depending on their use of pangeo, if we have a page on pangeo.io explaining the new hubs, this should ease the process.

Do we want to make a few loud announcements (twitter/github/email) to make sure people are aware of the change?

Having a short communication on the channels you mention sounds good!

rsignell-usgs commented 5 years ago

We might also provide a short recipe for folks to backup their stuff. I just tarred and downloaded my notebooks using the approach below, but I'm sure there is a more robust one-liner that could replace this:

Create a tarfile with all notebooks except those in .ipynb_check directories:

find . -type f -name '*.ipynb' -print | grep -v '.ipynb_check' > list_of_notebooks
tar -cvf backup.tar -T list_of_notebooks
gzip backup.tar

Download backup.tar.gz via the Jupyter interface.

pangeo-data / pangeo

refactoring pangeo.pydata.org #373

494 could be one of the answer, we should redirect people to domain specific, new hubs or directly binder depending on their use of pangeo, if we have a page on pangeo.io explaining the new hubs, this should ease the process.