pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
699 stars 189 forks source link

AWS Deployment #71

Closed rabernat closed 4 years ago

rabernat commented 6 years ago

It would be great to deploy our jupyterhub setup on AWS. There is a lot of community investment already in AWS. At ESIP, the HDF guys mentioned they would be interested in collaborating on this.

@jreadey, @rsignell-usgs: is either of you available to work on this?

yuvipanda commented 6 years ago

Current AWS docs (https://zero-to-jupyterhub.readthedocs.io/en/v0.5-doc/create-k8s-cluster.html#setting-up-kubernetes-on-amazon-web-services-aws), and https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/299 is issue for us to document doing this on amazon's new managed kubernetes service.

jhamman commented 6 years ago

Also cc @robfatland and @atlehr who are currently running a jupyterhub instance at AWS and may be interested in migrating to the pangeo setup. Do either of you want to try this out?

rsignell-usgs commented 6 years ago

@jreadey and I are going to write a proposal to https://aws.amazon.com/earth/research-credits/ to implement this framework on AWS. Currently the AWS call for research credits is focusing on proposals that use "Earth on AWS" datasets (https://aws.amazon.com/earth/).

One of those datasets is UK met office forecast data (https://aws.amazon.com/public-datasets/mogreps/) which are in NetCDF4 files, well suited for analysis with this framework.

Our plan is to get this jupyterhub setup going on AWS and also to put that NetCDF4 data into HSDS and perhaps compare/contract to access with zarr. @jflasher from AWS is willing to find us help if we run into problems with the deployment.

Can someone (perhaps offline at rsignell@usgs.gov) give me an idea of how many credits you guys have used in the last month so that I have some idea how much to ask AWS for?

rabernat commented 6 years ago

@rsignell-usgs: that's great news! I support your plans 100%.

We published our original NSF proposal under a CC-BY license here: https://figshare.com/articles/Pangeo_NSF_Earthcube_Proposal/5361094 I encourage you to reuse any parts of this you wish for the AWS proposal. And it would be great to have your proposal shared with the community under a similar license.

I see no reason why the cost details have to be communicated in private. For Jan 1-15, we did about $400 in compute and $20 in storage on GCP. We are storing about 700 GB right now, but we are about to start uploading some much bigger datasets. So I expect storage costs to increase somewhat. The compute charges reflect the usage of pangeo.pydata.org and the associated dask clusters. We have not really been doing any heavy, long-running calculations, so I also expect that to increase.

rabernat commented 6 years ago

This was the biggest single line item on our 15-day billing statement:

Compute Engine Standard Intel N1 2 VCPU running in Americas: 3332.485 Hours

mrocklin commented 6 years ago

Note that much of this cost is just idle nodes. My guess is that if users can become comfortable with the startup time of a node on GCE then we can use adaptive deployments and become much more efficient.

On Mon, Jan 15, 2018 at 10:55 AM, Ryan Abernathey notifications@github.com wrote:

This was the biggest single line item on our 15-day billing statement:

Compute Engine Standard Intel N1 2 VCPU running in Americas: 3332.485 Hours

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/71#issuecomment-357721856, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszOOwmuIdBBA3LPRjwYsdH4VOjxd7ks5tK3UOgaJpZM4Rc2aA .

jreadey commented 6 years ago

Amazon Kubernetes (EKS) distributes containers across a set of instances provided by the account owner. So if the number of containers is highly variable (as is likely the case as users come and go) it's easy to have either a under-utilized or over-committed cluster.

This project may be worth looking into: https://github.com/kubernetes/autoscaler. Supports GCS too!

mrocklin commented 6 years ago

Yeah, our use case is somewhat more complex than Kubernetes autoscalers due to the need to manage stateful pods. We're actually managing pods dynamically ourselves.

This isn't actually the kind of autoscaling we need though. We're more interested in autoscaling the nodes themselves. Unfortunately provisioning nodes takes significantly longer than deploying new pods.

rsignell-usgs commented 6 years ago

Does this mean that the pangeo framework would not benefit from EKS when deployed on AWS?

mrocklin commented 6 years ago

It would be fine. What I'm saying is that there is typically a minute or two to provision new nodes in an elastic cluster. These couple minutes can be annoying to users. That's the only issue I'm bringing up.

yuvipanda commented 6 years ago

Indeed, that is a problem both for dask and jupyterhub. I've filed https://github.com/kubernetes/autoscaler/issues/148 which should vastly improve the situation for us if it gets implemented, and am playing with workarounds in https://github.com/berkeley-dsep-infra/data8xhub/issues/7 until that gets implemented upstream. We also added the ability to pack nodes (rather than spread them) in JupyterHub chart (https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/384) to make the situation easier...

robfatland commented 6 years ago

could someone define 'stateful pods' for me? it is an interesting term...

On Tue, Jan 16, 2018 at 9:44 AM, Yuvi Panda notifications@github.com wrote:

Indeed, that is a problem both for dask and jupyterhub. I've filed kubernetes/autoscaler#148 https://github.com/kubernetes/autoscaler/issues/148 which should vastly improve the situation for us if it gets implemented, and am playing with workarounds in berkeley-dsep-infra/data8xhub#7 https://github.com/berkeley-dsep-infra/data8xhub/issues/7 until that gets implemented upstream. We also added the ability to pack nodes (rather than spread them) in JupyterHub chart (jupyterhub/zero-to- jupyterhub-k8s#384 https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/384) to make the situation easier...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/71#issuecomment-358044994, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6Y2ZAnAVK8W3TSsGtgKlEb7nvh0Zsjks5tLOATgaJpZM4Rc2aA .

-- Rob Fatland UW Director Cloud Data Solutions

mrocklin commented 6 years ago

Our dask-worker pods have state that we care about (like intermediate data for an ongoing computation) so we manage them ourselves.

On Tue, Jan 16, 2018 at 1:04 PM, Rob Fatland notifications@github.com wrote:

could someone define 'stateful pods' for me? it is an interesting term...

On Tue, Jan 16, 2018 at 9:44 AM, Yuvi Panda notifications@github.com wrote:

Indeed, that is a problem both for dask and jupyterhub. I've filed kubernetes/autoscaler#148 https://github.com/kubernetes/autoscaler/issues/148 which should vastly improve the situation for us if it gets implemented, and am playing with workarounds in berkeley-dsep-infra/data8xhub#7 https://github.com/berkeley-dsep-infra/data8xhub/issues/7 until that gets implemented upstream. We also added the ability to pack nodes (rather than spread them) in JupyterHub chart (jupyterhub/zero-to- jupyterhub-k8s#384 https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/384) to make the situation easier...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/pangeo-data/pangeo/issues/71#issuecomment-358044994 , or mute the thread https://github.com/notifications/unsubscribe-auth/ AF6Y2ZAnAVK8W3TSsGtgKlEb7nvh0Zsjks5tLOATgaJpZM4Rc2aA .

-- Rob Fatland UW Director Cloud Data Solutions

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/71#issuecomment-358051531, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszE--fPBf-D9EDrNbp_XFYcRzih4pks5tLOSYgaJpZM4Rc2aA .

jreadey commented 6 years ago

Has anyone looked at this project: https://github.com/kubernetes/autoscaler?

It seems like the optimal thing for JupyterLab scaling would be to always have some reserve capacity so that new containers could quickly get launched within an existing instance. When the reserve runs low, fire up a new instance. If there is excess capacity, consolidate containers and shutdown an instance or two.

yuvipanda commented 6 years ago

@jreadey indeed, that is the default upstream Node Autoscaler. It unfortunately only spins up a new node when your current cluster is 100% full, and new node creation can take minutes. If you see my previous reply, the issue I filed is in the same repo I linked to! The feature request is to add the concept of 'reserve capacity', which does not exist in that autoscaler yet. I also linked to one of our ongoing hack attempts to provide the concept of 'reserve capacity' until it gets added to upstream. That's really the only missing feature for it to be very useful for us, I think.

Hope that makes it a little clearer! Sorry for not providing more context in the previous comment!

jreadey commented 6 years ago

@yuvipanda - Sorry, I should have read through this issue first!
Anyway, looks like stars are aligning; I'll keep an eye on the issue you opened.

jreadey commented 6 years ago

Is the Pangeo team interested in utilizing the newly launched AWS EKS service: https://aws.amazon.com/eks/? Compared with a roll your own Kubernetes cluster, I imagine the EKS approach would involve less setup effort and provide a more stable environment.

Currently EKS is in preview though. I've applied to participate in the preview, but haven't been selected yet.

mrocklin commented 6 years ago

We're currently using the equivalent service at Google. I'm generally in favor of managed kubernetes systems.

On Tue, Jan 16, 2018 at 3:04 PM, John Readey notifications@github.com wrote:

Is the Pangeo team interested in utilizing the newly launched AWS EKS service: https://github.com/kubernetes/autoscaler? Compared with a roll your own Kubernetes cluster, I imagine the EKS approach would involve less setup effort and provide a more stable environment.

Currently EKS is in preview though. I've applied to participate in the preview, but haven't been selected yet.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/71#issuecomment-358087489, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAmBodSSgc-rnzuPkIsiULf8P6yJks5tLQC7gaJpZM4Rc2aA .

amanda-tan commented 6 years ago

We managed to bring up the Jupyterhub + Deployment on AWS by following the steps outlined in Z2JH for spinning up a K8 cluster on AWS using the Heptio template and using the config here: https://github.com/pangeo-data/pangeo/tree/master/gce . Only mod. was to config.yaml taking out the GCS stuff. Pretty straightforward.

rabernat commented 6 years ago

Can someone give me an update on the status of Pangeo AWS deployment(s)?

jhamman commented 6 years ago

@rabernat - The UW eScience team (@atlehr and @robfatland) have deployed something very similar to the GCP deployment (see #95 and https://pangeo-aws.cloudmaven.org). I think @atlehr was planning to revisit a few pieces of the initial deployment but I'm not sure of her time table. IIRC, their next steps were to do some FUSE stuff (e.g. mount s3://nasanex/) and experiment with Kubernets Operations (kops) for autoscaling. They may also be using their deployment for an upcoming OceanHackweek.

rabernat commented 6 years ago

FYI, KubeCluster is not working for me on pangeo-aws...the workers never start. I think @tjcrone is having a similar problem as he tries to prepare some OceanHackweek OOI tutorials.

mrocklin commented 6 years ago

First, see http://daskernetes.readthedocs.io/en/latest/ for how to use it. It has changed since the original deployment of pangeo.pydata.org .

Second, it would be useful to get the logs of worker pods that are failing, either through the kubernetes dashboard, kubectl logs, or cluster.logs(cluster.pods()[0]) .

On Tue, Feb 20, 2018 at 3:13 PM, Ryan Abernathey notifications@github.com wrote:

FYI, KubeCluster is not working for me pangeo-aws...the workers never start. I think @tjcrone https://github.com/tjcrone is having a similar problem as he tries to prepare some OceanHackweek OOI tutorials.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/71#issuecomment-367102972, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszOu-ibd1rxO6MIZgjfbfWq8eQt4Mks5tWyYjgaJpZM4Rc2aA .

robfatland commented 6 years ago

This is a recorded announcement as I’m afraid we’re all out at the moment preparing for Cabled Array Hack Workshop. The commercial council of Magrathea thanks you for your esteemed visit, but regrets that the entire planet is closed for business. Thank you. If you would like to leave your name, and a planet where you can be contacted, kindly speak when you hear the tone [Beep]

robfatland commented 6 years ago

By which I mean -- as the reference is perhaps too obscure (my colleague points out) -- the pangeo-aws experiment is Amanda trying to get ahead of the Jupyter Hub curve. However this is a 'stretch' effort on her part, currently un-funded. As noted we're heads-down trying to get another JHub up for the ocean hack workshop. So the update is that we'd still like to get kops going as Joe pointed out but this is on hold for the moment.

amanda-tan commented 6 years ago

The error was insufficient cpu/memory. The current setup we have using Heptio does not allow for autoscaling -- I think that's the problem; we have to set the number of nodes by hand with a maximum of 20. Clarification question: Are kubernetes nodes the same as dask worker nodes?

mrocklin commented 6 years ago

Your cloud deployment has VMs of a certain size (maybe 4 cores and 16GB of RAM each). Kubernetes is running on each of these VMs. Both JupyterHub and Daskernetes will deploy Pods to run either a Jupyter server or a dask worker. These pods will hopefully have resource constraints, ideally somewhere below your VM resources. These pods will typically run a Docker container that runs the actual Jupyter or Dask process.

So no, Kubernetes nodes are not the same as dask worker nodes. The term node here is a bit ambiguous. You will want to ensure that your cloud-provisioned VMs have more resources than is required by your Kuberentes Pods, for either Jupyter or Dask.

rabernat commented 6 years ago

@robfatland: I hear you on your focus on the Ocean Hack Workshop. That was in fact what motivated me to look into this. You might want to reach out to Tim to clarify that in fact the hack week will be using a different jupyterhub deployment.

amanda-tan commented 6 years ago

@mrocklin is dask-config.yaml still being used?

mrocklin commented 6 years ago

Dask uses a file in ~/.dask/config.yaml but that probably isn't of relevance to you. An old branch of daskernetes that was used in the original deployment used a .daskernetes.yaml file that is no longer being respected in current versions. The current version optionally uses a user-specifiable yaml file for the worker template. See docs at http://daskernetes.readthedocs.io/en/latest/#quickstart

jreadey commented 6 years ago

FYI - we got some AWS credits last week, so we'll try out deploying jupyterhub to AWS as well.

amanda-tan commented 6 years ago

We have been getting a Error: Job failed: BackoffLimitExceeded error and checking the pod status quickly, this is what I get: pull-all-nodes-1519248280-jupyter-1-6mdwl 0/1 Error 0 25m

Describing the pod status:

Name: pull-all-nodes-1519250691-jupyter-1-pq2v5 Namespace: pangeo Node: ip-10-0-17-250.us-west-2.compute.internal/10.0.17.250 Start Time: Wed, 21 Feb 2018 14:06:14 -0800 Labels: controller-uid=3dd21c9a-1753-11e8-a147-0626f0f13da4 job-name=pull-all-nodes-1519250691-jupyter-1 Annotations: Status: Failed IP: 192.168.72.89 Controlled By: Job/pull-all-nodes-1519250691-jupyter-1

Does anyone have any pointers?

mrocklin commented 6 years ago

Nothing from me. If you're having trouble launching Jupyter pods then I recommend starting from something simple, like a vanillia zero-to-jupyterhub deployment , and then slowly adding things into the configuration until things break.

On Wed, Feb 21, 2018 at 5:33 PM, ATLehr notifications@github.com wrote:

We have been getting a Error: Job failed: BackoffLimitExceeded error and checking the pod status quickly, this is what I get: pull-all-nodes-1519248280-jupyter-1-6mdwl 0/1 Error 0 25m

Describing the pod status:

Name: pull-all-nodes-1519250691-jupyter-1-pq2v5 Namespace: pangeo Node: ip-10-0-17-250.us-west-2.compute.internal/10.0.17.250 Start Time: Wed, 21 Feb 2018 14:06:14 -0800 Labels: controller-uid=3dd21c9a-1753-11e8-a147-0626f0f13da4 job-name=pull-all-nodes-1519250691-jupyter-1 Annotations: Status: Failed IP: 192.168.72.89 Controlled By: Job/pull-all-nodes-1519250691-jupyter-1

Does anyone have any pointers?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/71#issuecomment-367500028, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHoo25I_9-iMk_Ekdwnl7J_EVqQnks5tXJm-gaJpZM4Rc2aA .

yuvipanda commented 6 years ago

@atlehr I think you're using an old version of the zero to jupyterhub guide. Can you try the latest version (0.6)?

amanda-tan commented 6 years ago

@yuvipanda We were using 0.6 already -- jupyter/basenotebook had some big changes a couple of days ago; reverting to an older version solved the problem. Re-testing to make sure that is really the case.

richiverse commented 6 years ago

@yuvipanda Will there be a EKS expansion on the z2jh guide? That can still be done while EKS is in preview mode.

Great job btw on the docs!

yuvipanda commented 6 years ago

@richiverse yeah, once Amazon lets us in on the preview, we plan on adding EKS to the list there! It might fully supplant the current quickstart based setup!

jreadey commented 6 years ago

@amanda-tan - I've been following the Heptio guide for setting up Kubernetes on AWS. I was unclear how to setup a HTTPS public endpoint though. Did you get that working? Theoretically certs should be cheap and easy to manage on AWS using the AWS Certificate Manager.

Actually I see pangeo.pydata.org is not using https. Is this a security hole?

mrocklin commented 6 years ago

pangeo.pydata.org is in no way secure

On Fri, Mar 16, 2018 at 2:25 PM, John Readey notifications@github.com wrote:

@amanda-tan https://github.com/amanda-tan - I've been following the Heptio guide for setting up Kubernetes on AWS. I was unclear how to setup a HTTPS public endpoint though. Did you get that working? Theoretically certs should be cheap and easy to manage on AWS using the AWS Certificate Manager.

Actually I see pangeo.pydata.org is not using https. Is this a security hole?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/71#issuecomment-373803764, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIkqNxeQXZfZ_3AKmi3UlPWJO2FVks5tfAOpgaJpZM4Rc2aA .

jreadey commented 6 years ago

There are different aspects of security. E.g. we might be ok with others users notebooks being visible. But likely not to user's passwords being intercepted. Is the later a problem or does OAuth2 mitigate that?

amanda-tan commented 6 years ago

@jreadey I used the automatic https setup outlined here: http://zero-to-jupyterhub-with-kubernetes.readthedocs.io/en/latest/security.html

jreadey commented 6 years ago

@amanda-tan - did you use AWS ECR for the image store?
I'm using it for the user environment and the container seems to get the image ok, but there's a startup error: container_linux.go:247: starting container process caused "exec: \"jupyterhub-singleuser\": executable file not found in $PATH". Any idea what that could be about?

amanda-tan commented 6 years ago

@jreadey Are you using you building your own image instead of the default/official Pangeo images hosted on DockerHub? @mrocklin might have a better idea but if you're building your own docker image, have you tried using a tagged version of jupyter/basenotebook that corresponds to what Pangeo needs? The latest jupyter/basenotebook updates to start-singleuser.s has been causing all kinds of weird errors for me.

jacobtomlinson commented 6 years ago

Just to chip in, our Pangeo deployment is running on AWS. We use kops to manage our cluster because EKS didn't exist when we set it up (and we still haven't managed to get access to the preview).

For SSL certificates we are using cert manager and external DNS.

For a base image we have our own which just adds some extra bits to the jupyter/scipy-notebook image.

jreadey commented 6 years ago

I was using repo2docker, but I'll try the Dockerfile approach you used.

jreadey commented 6 years ago

I built a docker image based on @jacobtomlinson image. Got the same error with helm upgrade, but (for some reason) doing a helm delete & re-install worked!

jreadey commented 6 years ago

@jacobtomlinson - for my setup, I used a sub-domain off of hdfgroup.org and had our DNS configured to point to the AWS ELB dns name. Then I used the AWS certificate manager to generate a cert and installed on the ELB. At first glance it looked fine - I could go to the the hdfgroup dns name and sign in to jupyterhub. But it seems that any interaction with the kernel was broken - any notebook commands would never complete (and also I couldn't run the JupyterLab terminal).

Am I missing something obvious in thinking my setup would work?

jacobtomlinson commented 6 years ago

My initial thought is that should work. Kernel interactions are done via websockets, so perhaps there is some ELB websockets issue going on?

jreadey commented 6 years ago

@yuvipanda - I have an AWS account that is signed up for the EKS preview. Let me know if you'd like to try it out - I can setup you up with some temporary IAM credentials.

rsignell-usgs commented 6 years ago

@jacobtomlinson , we would love to get a true pangeo instance going with EKS (we just got approval) using our AWS allocation.

As a simple oceanographer (e.g. not @rabernat ) I'm unsure of the effort involved with "Zero to Pangeo", but if it's something that could be done in a hour or so, would you be willing to have a short session with us via screen share and we could record for benefit of others?

Last year we made a recording of @yuvipanda demonstrating how to deploy JuptyerHub with Kubernetes on Google Cloud as a special ESIP Tech Dive talk and it was really great. It's been watched over 300 times!