Closed rabernat closed 5 years ago
But in order for that to happen, we would still need a way to make it easier to share examples with the group.
When we interact with a new commuity we'll be giving them a cookie-cutter repository template. That repository will have an examples
directory that they will be encouraged to fill with notebooks. Given how well this has worked in the original deployment I expect that other communities will continue this tradition. As long as this is in a uniform place (which is likely given that we're providing a cookie cutter) then presumably we can scape these directories and place them into either documentation, or provide a binder-able version of the same repository.
Good points Matt. But I worry that the complexity of adding new notebooks to the examples repo will exclude some users. Currently that process is something like:
My understanding is that users should not use gitย to push directly from the cloud environment because it is not secure. But maybe I am wrong about this. Perhaps we can figure out a way to forward the git login credentials from the hub (@minrk mentioned something about this in his JupyterCon talk). Then at least they could avoid the download step.
I miss the good old days when I had a gist button in my jupyter notebooks. This was such an easy way to share what you were working on. It would be nice to aim for that sort of one-click shareability.
Pushing to a forked copy of the examples should work directly from the the container, but it'll ask for github username/password, which should be fine.
which should be fine.
it would work, but it is not an optimal user experience. From the user perspective, they have already authenticated with git. Why should they have to do it again? Plus copying / pasting passwords on the command line is high friction.
I know some of these things are complicated and perhaps impractical. But I think it's worth discussing a vision for where we would like to go.
My understanding is that users should not use git to push directly from the cloud environment because it is not secure. But maybe I am wrong about this.
It depends on what you mean by "not secure." It's certainly "less secure" in that it's adding more people and systems to trust with your credentials. Storing credentials at rest is entrusting those running the service (you folks) to do so securely, and not abuse their access. For instance, if you enable auth state to store a GitHub API token and pass it down to the user environment, this is stored in the jupyterhub database in an encrypted form, but an admin of the JupyterHub deployment has sufficient access to decrypt this data (kubectl exec -it
privileges on either the hub pod or user pod is enough). So it's a best effort to prevent malicious access, but users are not protected from trusted administrators by anything other than a given deployment's policy.
one-time login with username and password at push time as @martindurant mentions doesn't require trusting the hosting provider as much, so it's the usual tradeoff of trust vs convenience. It should be possible to ensure that at least the username is set, so only the password needs typing.
I miss the good old days when I had a gist button in my jupyter notebooks.
The gist button could be reworked to be a server-side action, in which case it could inherit the same credentials in the env, rather than being a purely javascript extension. The user experience could still be the same button, but get rid of the fiddly token management, and use the same setup as the gist gem.
Is this the same button as before, or something new? I assume that in this case, the user's browser takes care of auth.
^ actually, that specific repo says it doesn't work with the new auth system of gist, so never mind.
I think it would be really nice if we could define a "kubernetes-pangeo-jupyterhub" with something like terraform. This would help us provide specification for a kubernetes cluster as well as the helm chart stuff. I know @jonahjoughin / @aaarendt were playing around with this at some point. It would obviously be super nice if new pangeo deployments were as simple as:
cd pangeo
terraform init
terraform apply
This is not something I have a lot of experience with but streamlining the (re)deployment with something like this seems like an obvious win for us non-devop kinds.
I think it would be really nice if we could define a "kubernetes-pangeo-jupyterhub" with something like terraform
๐ on this! I'd be happy to help, but not sure I can find the time.
You may find this useful. We create our Kubernetes cluster on AWS using a combination of Terraform and Kops.
Two organizational thoughts here:
Operationally my original thoughts for how this would be restructured was as three pieces
To move forward on this I think we need to push on the CI/CD part. I suspect that that will force a conversation about what each domain-specific repository will have to specify.
@yuvipanda , you mentioned that you had done some work on JupyterHub deployment. Is that hubploy https://hubploy.readthedocs.io/en/latest/ ?
Just jumping in to support the domain specific refactor and suggest a seaice.pangeo.io: I would like to use it for my sea ice prediction research and allow our SIPN2 participant modelers to use it for theirs. Would this fit within Pangeo? I can see getting more sea ice researchers onboard with the domain specific approach. Happy to help set up/test a seaice.pangeo.io, but need guidance on where to start (I assume waiting for a cookie-cutter repository template?).
@NicWayand yes, you're exactly the sort of person we would be targetting. Someone who would be happy to do some work to curate an environment, examples, and datasets and socialize them within a new community, but probably won't go through the effort of setting up your own Dask/XArray-enabled JupyterHub deployment.
From private conversation with Yuvi he mentioned that his recent work on deployment lives here: https://github.com/berkeley-dsep-infra/datahub
https://hubploy.readthedocs.io/en/latest/ is the tool used to do the deployments. You can see how CircleCI is used to automatically:
as an example at https://github.com/berkeley-dsep-infra/datahub.
I think Pangeo is a great use case for pushing hubploy development forward! :D
So @dsludwig has some time to take a look at this. I think that he probably needs an example repository that a scientific group leader would create. This will probably need both a repo2docker compatible repository as well as some config information to provide to JupyterHub. I imagine that we'll have to iterate on this a bit before we get it right. Do we have something that would be a good starting point? I'm thinking that one of the Pangeo-Binder examples might be useful. cc @jhamman
I imagine that alongside a repo2binder compatible repository we'll also need a jupyter-config.yaml file, like https://github.com/pangeo-data/pangeo/blob/master/gce/jupyter-config.yaml (though presumably a lot of that, like the privileged containers and fuse mounts, can be ripped out.)
I would be happy using a fork of https://github.com/pangeo-data/pangeo-example-notebooks as a place to prototype this work. How about:
That sounds sensible to me. @jhamman would you mind playing the role of "model science group leader" and constructing a repository that has the information that you think you should have to provide?
On Fri, Sep 7, 2018 at 12:58 PM, Joe Hamman notifications@github.com wrote:
I would be happy using a fork of https://github.com/pangeo- data/pangeo-example-notebooks as a place to prototype this work. How about:
- Use the Binder stuff I've already done
- Move the notebooks to a subdirectory (say /notebooks)
- Make a juypyterhub directory where we can put the configs
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/373#issuecomment-419502503, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszECuqsq6St9SBjcg0lR3YSNDolVYks5uYqXAgaJpZM4WN-_z .
I think just would want to share this: https://github.com/pangeo-data/pangeo-example-notebooks/tree/pangeo-refactor
As a group leader, I'm hoping that (nearly) all of the JupyterHub configuration is done upstream. I just want to provide:
I think that you'll also want to include a small JupyterHub config file that lists names of github users or orgs that you'd like to see given access
On Fri, Sep 7, 2018 at 1:08 PM, Joe Hamman notifications@github.com wrote:
I think just would want to share this: https://github.com/pangeo- data/pangeo-example-notebooks/tree/pangeo-refactor
As a group leader, I'm hoping that (nearly) all of the JupyterHub configuration is done upstream. I just want to provide:
- my example notebooks
- my environment file
- my computation configs (dask configs in this case)
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/373#issuecomment-419505112, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszEEqUpiRBSossEDaDcheCs4k9Mp1ks5uYqf2gaJpZM4WN-_z .
I am happy to see this moving along and am eager to contribute.
I personally would like to see ocean.pangeo.io be developed. However, I do not have the bandwidth right now (start of semester) to be the point person on this. (I'm hoping @raphaeldussin can get involved when he gets back from vacation.) So I support the idea of using @jhamman as a guinea pig group leader.
A couple of technical points:
I've started on setting up an example deployment configuration here: https://github.com/dsludwig/example.pangeo.io-deploy based on the notebooks provided by @jhamman using https://github.com/yuvipanda/hubploy.
One change I've made is using repo2docker to build the image instead of a simple Docker build, with the idea that would be easier for a group leader to create their desired environment.
Unfortunately I seem to have run into some issues with getting authentication working: https://circleci.com/gh/dsludwig/example.pangeo.io-deploy/6
@yuvipanda would you be able to help me figure out how to resolve it?
Now that pangeo-binder is more-or-less deployed, I am eager to move forward with this.
I would like to be involved in setting up ocean.pangeo.io. Hopefully @raphaeldussin can get involved as well.
@rabernat sure!
@dsludwig awesome! I've two questions:
I would recommend trying to debug this with the following:
pangeo-181919/example-pangeo-io-notebook:latest
. Can be anything!I think this is a change that needs to happen in hubploy, where it can't differentiate between actual 'internal service error' and just 'this container image does not exist at all' vs 'this container image does not exist at this revision'. If following the three steps above fixes it, that'll give us a clear path forward on hubploy.
Thank you for helping test hubploy! <3
Thanks @yuvipanda, that was enough to get me unblocked.
What I've got so far:
@dsludwig awesome!
repo2docker tries to put the entire environment (virtualenvs, conda envs, R stuff, etc) in /srv, and only the contents in $HOME. This was an explicit decision, since in most JupyterHub installations $HOME is persistent storage. I'd recommend using repo2docker to construct the image environments, but using something like https://github.com/jupyterhub/nbgitpuller to provide contents of the git repository on the home directory.
Thanks @yuvipanda. I ended up getting the home directory populated with an entrypoint, copying content from /srv/home
to ${HOME}
.
I have a deployed instance that seems to be working correctly. Try it out here: http://104.154.59.98
It's deployed with the notebooks here: https://github.com/dsludwig/example.pangeo.io-deploy/tree/staging/deployments/example.pangeo.io/image and the environment here: https://github.com/dsludwig/example.pangeo.io-deploy/blob/staging/deployments/example.pangeo.io/image/binder/environment.yml
How I imagine this working for the different deploys:
It would be good to get @jhamman's thoughts here.
As a next step I recommend that we separate out the deployments/ example.pangeo.io folder to a separate repository and give access to @jhamman to play around with it, commit changes, see results, etc.. He might be able to provide feedback on what feels natural and not.
(I'm volunteering Joe here, but others may also be interested in participating in this)
On Fri, Sep 21, 2018 at 2:29 PM, Derek Ludwig notifications@github.com wrote:
Thanks @yuvipanda https://github.com/yuvipanda. I ended up getting the home directory populated with an entrypoint, copying content from /srv/home to ${HOME}.
I have a deployed instance that seems to be working correctly. Try it out here: http://104.154.59.98
It's deployed with the notebooks here: https://github.com/dsludwig/ example.pangeo.io-deploy/tree/staging/deployments/example.pangeo.io/image and the environment here: https://github.com/dsludwig/ example.pangeo.io-deploy/blob/staging/deployments/example. pangeo.io/image/binder/environment.yml
How I imagine this working for the different deploys:
- group leader will clone/fork/cookiecutter a new repository for each deploy
- pangeo representative will create a new cluster & service account with the appropriate permissions and share the account key with the group leader
- group leader creates a CircleCI job for the new repository, fills in the appropriate environment variables
- group leader customizes the configuration, examples and environment to fit their requirements
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/373#issuecomment-423630882, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszP_CD-YpaMjF1kwC9y2R9hhlsl0rks5udS_2gaJpZM4WN-_z .
@dsludwig this seems great! I think the procedure you have defined sounds reasonable.
It could be useful to brainstorm what are all the ways that the different deployments can be customized. Here are a few:
environment.yml
)I would like to get @raphaeldussin to try to set up ocean.pangeo.io. Raphael is very knowledgeable about a huge range of science and technical issues; however, he doesn't have too much experience with cloud. (In contrast to @jhamman, who is very familiar with the whole stack.) If he doesn't mind being a guinea pig, this will allow us to find the pain points with the deployment procedure.
Now that https://github.com/terraform-providers/terraform-provider-google/issues/2022 is solved, we may be able to include the terraform scripts from here in the CI setup. This would allow the cluster admin to configure machine size.
@dsludwig yesterday you mentioned that you probably needed to push some changes upstream to a few other repositories (hubploy, repo2docker). Would you mind linking to those PRs here when you submit them?
@jhamman I have a working example of a terraform-deployable cluster on AWS here which @aaarendt has been using for the last week or so. Nothing is running through helm right now, because everything is going directly through terraform's Kubernetes provider, but I've gotten the deployment process down to:
terraform init
terraform apply --target=module.eks
terraform apply --target=module.kubernetes
@jonahjoughin it seems that your note using Jupyterhub nor dask-kubernetes in your deployment, am I correct?
If I understand correctly, you start one notebook server and one dask scheduler, does your cluster support multitenancy? Does it provide auto scaling features based on the processing load as KubeCluster does with adaptive mode?
It isn't using JupyterHub or dask-kubernetes for the time being. The primary aim of this project is to make it simple for anyone to set up and tear down a private cluster quickly. Because there isn't any need for multi tenancy, autoscaling is relatively unimportant. The pool of dask workers simply fills up any available space on the cluster, the maximum size of which is determined by the user.
It might be possible to use something like terraform-provider-helm to adapt this to JupyterHub in the future.
Checking in here. It looks like @dsludwig has an implementation up at https://github.com/pangeo-data/example.pangeo.io-deploy
There hasn't been much activity around this, which I'm attributing to one of two causes:
@dsludwig I wonder if it would make sense just to go through this process a few times and set up a few repositories at
and set up the CI/CD system for those repositories. Then we can give rights to those repositories to people and have them go to town without having to first step through the process. Thoughts?
๐ to your plan Matt. Then we can try to get @raphaeldussin started on ocean.
I am planning to try that out, once I find the bandwidth.
On Tue, Oct 9, 2018 at 6:58 AM Ryan Abernathey notifications@github.com wrote:
๐ to your plan Matt. Then we can try to get @raphaeldussin https://github.com/raphaeldussin started on ocean.
โ You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/373#issuecomment-428202447, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPNgKwxiFyRuwF89WBbu9eQsJOSbGhks5ujKt6gaJpZM4WN-_z .
@bartnijssen and @NicWayand are the ones currently responsible for the hydro and polar pangeo deployments. I think @NicWayand is already using the CI deployment tools, @bartnijssen has yet to give that a try (IIRC).
I just made a PR with some changes to @dsludwig's instructions here.
I'm going to go ahead and say we're ready to tear down pangeo.pydata.org.
Do we want to archive peoples PDs somehow?
I'm suggest that we don't do this.
How do we want to coordinate getting people access to the refactored hubs (e.g. ocean.pangeo.io)?
Do we want to make a few loud announcements (twitter/github/email) to make sure people are aware of the change?
Having a short communication on the channels you mention sounds good!
We might also provide a short recipe for folks to backup their stuff. I just tarred and downloaded my notebooks using the approach below, but I'm sure there is a more robust one-liner that could replace this:
.ipynb_check
directories:
find . -type f -name '*.ipynb' -print | grep -v '.ipynb_check' > list_of_notebooks
tar -cvf backup.tar -T list_of_notebooks
gzip backup.tar
backup.tar.gz
via the Jupyter interface.
As we discussed at the developer meeting, the time may have come to refactor pangeo.pydata.org.
Let's discuss how we would like our future services to look. I think we are all in agreement that, for non-persistent demos, we definitely want to move to @jhamman's binder.pangeo.io solution.
But what about the "production" services. There are some people (including myself, @chiaral, @chiara7, and who knows who else) who are using pangeo.pydata.org for actual science.
I personally really like @mrocklin's idea of refactoring pangeo.pydata.org into several more specific services. For example:
Each cluster could have a distinct:
The advantage of this approach is that it might catalyze collaboration around specific scientific problems. But in order for that to happen, we would still need a way to make it easier to share examples with the group.