pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
698 stars 188 forks source link

Educational events on Pangeo resources #440

Closed mrocklin closed 8 months ago

mrocklin commented 5 years ago

Hello educator!

We're pleased to hear that you're interested in using Pangeo's cloud deployments for an educational event. To make sure that things run smoothly we ask that you post the following information here before your event:

This helps us both by ensuring that the cluster is sufficiently large during your event (otherwise your students may not get as many cores as they expect) and by providing us information to give back to our funding agencies about how their resources benefit the public.

Edit

For educators wishing to use this cluster, you may want to pre-allocate a bunch of VMs before your students arrive. This will make sure that VMs are around when they try to log on. Otherwise they might have to wait a few minutes while Google gives us machines.

Typically I do this by logging into a Jupyter notebook, and then allocating a fairly large cluster. To do this I need to overwrite the default maximum number of allowed workers.

dask.config.set({'kubernetes.count.max': 1000})
cluster.scale(1000)
# wait a bit until they arrive
cluster.scale(0) # release the pods back to the wild, the VMs should stick around for a bit

This forces the worker node pool to grow, and then those workers stick around for a bit. It may take a while for the cloud to give us enough machines. I would do this at least 30m before the tutorial start, and possibly an hour before. You can track progress by watching the IPython widget output of KubeCluster, which should update live.

You definitely want torelease the pods back to the wild before the tutorial starts, but not too soon before, otherwise the cloud provider will clean up the VMs. Maybe run scale(0) a minute before things start off (but in practice you should have 10-20 minutes grace period here).

mrocklin commented 5 years ago

There may already be a tool that helps to visualize what GCP or Kubernetes are doing? It might be worth raising that as a separate issue.

On Wed, May 8, 2019 at 9:52 AM Ryan Abernathey notifications@github.com wrote:

I think having some visualization of what the cluster is doing would go a long way towards alleviating user / instructor anxiety during these tutorials. I'm talking about a basic visual representation of the node / pod information that kubectl can provide. Ideally this could be a tab in Jupyterlab, much like the dask extension.

This is motivated by our experience with the dask dashboard. Users are happy to wait for things if they have feedback about what the computers are doing. But just waiting with no info / progress causes anxiety and confusion.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/440#issuecomment-490518697, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTG6A5P7KYQ5NPJJIM3PULSL3ANCNFSM4F6OXOHA .

rabernat commented 5 years ago

See https://discourse.jupyter.org/t/lab-extension-for-monitoring-kubernetes-cluster/1026

robfatland commented 5 years ago

@mrocklin @rabernat @guillaumeeb and other pros: I have softball noob questions on the pre-session worker pool spin-up procedure:

Let's suppose I have 25 participants where each will allocate 20 workers by default from Ryan's excellent sea surface notebook; so I need 500 VMs. Suppose my talk is at 2pm but I won't get to the pangeo demo slide until 2:45...

Reprinted without permission:

dask.config.set({'kubernetes.count.max': 1000})
cluster.scale(1000)
# wait a bit until they arrive
cluster.scale(0) 
# release the pods back to the wild, the VMs should stick around for a bit
mrocklin commented 5 years ago

First, you mean 500 pods, not VMs. The VMs given to us by the cloud service provider will be quite large.

When I log in to 'a Jupyter notebook': Do you intend specifically on the pangeo binder jupyter lab?

No preference between notebook and lab

dask.config and cluster.scale methods can be run from a notebook cell?

Yes, from any python environment

I don't need any particular authentication for this? i.e. anybody could walk in off the street?

Correct. Acknowledged that this is concerning :) Currently Pangeo's binder deployment is entirely open access.

Roughly when do I run these two lines?

Answered up top in a recent edit

How do I know I have waited long enough before running cluster.scale(0)?

Same as above

How long should those 1000 VMs persist for, ballpark?

Same as above

robfatland commented 5 years ago

In practicing: I wound up running...

import dask
from dask.distributed import Client, progress
from dask_kubernetes import KubeCluster

dask.config.set({'kubernetes.count.max': 1000})

cluster = KubeCluster(n_workers=20)
cluster

cluster.scale(1000)

This generated a big JSON dump; so far so good; but my execution was a bit too early so the demo was rife with pauses. No worries; I'll do some tests and edit this further with results. Up-side is that the demo came across as intended.

NickMortimer commented 5 years ago

Hi

I'm doing a training session on Pangeo in Hobart on Monday 13th I'll have about 35 people and would be good to use binder.pangeo.io just wanted to check that would be ok.

Data Science Community of Practice CSIRO Oceans and Atmosphere (Hobart UTC + 10) May 13th to 15th 8am to 5pm

https://github.com/jmunroe/pangeo-tutorial-c3dis-2019

30 people 5 workers approx 300 cores 4gb per core ?

robfatland commented 5 years ago

@NickMortimer as noted above I failed to pre-heat the system for a talk. I have it as a self-action-item to retry this and try and understand the process better; but main point: Please see Matt Rocklin's Edit on the very top entry and I strongly advocate you practice pre-heat-and-run in advance of the session.

robfatland commented 5 years ago

I'm possibly hosting 24 early-career oceanographers next week Tues - Friday. With @scottyhq 's assistance I set up a separate GitHub org called escience-pangeo where we can whitelist participants on an as-needed basis to the AWS-hosted pangeo JupyterHub. (Pls advise if we are missing any issues; the usage should not be too onerous / expensive.)

I would like to revisit using Ryan's sea level notebook which operates on a 74GB dataset. (In fact the source data may be larger; but the sea level anomaly piece is 74GB). It runs now on binder on the Google cloud and the data is in the GCP pangeo-data bucket. So far so good; no egress gotcha; but using the pangeo AWS JHub as Scott suggests will incur $7 egress per go, give or take.

We would save that fee by creating an S3 copy of the same data. Then the API call has to be modified in the notebook. @rabernat please advise when you have a sec.

rsignell-usgs commented 5 years ago

@robfatland are you using the AWS JupyterHub running in region us-west-2 or us-east-1?

I'd also like to have a copy of this data in AWS also, if someone hasn't copied it already. I'm ready to sync a copy of the sea level data (258GB) from GCS to AWS S3 with this command, if somebody from pangeo-access gives the okay!

rclone sync gcloud:pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt  aws-west:gcloud:pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt --checksum --fast-list --transfers 16
robfatland commented 5 years ago

Hi @rsignell-usgs Scott and I both say "ok". Further: Scott notes it is us-east-1 (nasa.pangeo.io) and there is a corresponding regional bucket pangeo-data-useast1 to use as destination. It is actually $0.12 / GB so we're looking at about $31 on the GCP account.

pbranson commented 5 years ago

To avoid incurring the egress cost I have downloaded the altimetry data from CMEMS and was going to convert to zarr, but it occurred to me that possibly @rabernat may have done some additional preparation steps beyond a simple conversion?

Given it is likely that future events are going to be ran in various locations I wonder if it's possible to develop some data curation scripts for the freely available datasets? (i.e the sea level data is from marine.copernicus.eu I think)

These would likely be useful examples in their own right for preparing/chunking zarr stores

Then people could prepare the data on available resources and distribute to the relevant buckets/regions.

On Fri., 24 May 2019, 3:16 am Rob Fatland, notifications@github.com wrote:

Hi @rsignell-usgs https://github.com/rsignell-usgs Scott and I both say "ok". Further: Scott notes it is us-east-1 ( nasa.pangeo.io) and there is a corresponding regional bucket pangeo-data-useast1 to use as destination. It is actually $0.12 / GB so we're looking at about $31 on the GCP account.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/440?email_source=notifications&email_token=ADG5WQHV3H2AZ2QGZFKYIOLPW3UPFA5CNFSM4F6OXOHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWDG3UA#issuecomment-495349200, or mute the thread https://github.com/notifications/unsubscribe-auth/ADG5WQA4Q6WBRD72NIZ32XLPW3UPFANCNFSM4F6OXOHA .

guillaumeeb commented 5 years ago

Hey, I'm planning to do a Pangeo tutorial at CNES (with CNES people and collaborators like Mercator).

rsignell-usgs commented 5 years ago

I'm planning to do a Pangeo live demo today between 1:30-2:30pm EDT at the AWS Public Sector Summit here in DC.

I'm going to use the pangeo binder on AWS us-west-2: https://aws-uswest2-binder.pangeo.io/v2/gh/reproducible-notebooks/hurricane-ike-water-levels/master?filepath=hurricane_ike_water_levels.ipynb, with 30 workers (60 cores).

I'm not going to encourage the attendees to try it in real time, but tell them they can try it later.

scottyhq commented 5 years ago

We're planning a Pangeo interactive tutorial Tuesday July 16 11:45 - 2:15 PST at the ESIP Summer meeting: https://sched.co/PtOj

We'll be using the ESIP landsat demo on AWS us-west-2: https://github.com/scottyhq/esip-tech-dive/tree/aws-binder

And possibly the pangeo tutorial materials on the Google Cloud Binder: https://github.com/pangeo-data/pangeo-tutorial

Anticipating about 20 participants, adaptive scaling with 1-5 workers per person.

djhoese commented 5 years ago

I'm teaching a Satpy tutorial at SciPy 2019 on Monday July 8th from 8am to 12pm with a focus on everyone running everything locally. However, some users weren't so sure if their machines met my preferred/minimum requirements (8GB RAM, 4 cores) so I'm considering suggesting pangeo binder for the 0-3 people who may need it. The repository that will be loaded is here.

NOTE: I haven't actually tested if all my examples will run on the binder without hitting memory limits.

djhoese commented 5 years ago

I'm teaching a small 4 hour session to <15 high schoolers for an introduction to Python and Satpy on Thursday, August 8th during a day camp. This is probably the lightest workload I'll ever put on pangeo's binder, but since it is running Satpy and I may want to demonstrate other Pangeo examples I'd like to run it on pangeo's binder.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jakirkham commented 4 years ago

Guessing this shouldn’t be marked as stale.

jacobtomlinson commented 4 years ago

I've added the pinned label which the stale bot should ignore.

cgentemann commented 4 years ago
cgentemann commented 4 years ago
willirath commented 4 years ago
dshean commented 4 years ago

March 13, 2020, 2:30-5:30 PM PDT (approximately 1 hour from now!) UW Geospatial Data Analysis course https://github.com/UW-GDA (course material to be released in coming month) 14 undergrad/grad students

This is our last class of the quarter. We've mostly focused on smaller datasets/problems thus far, as I'm emphasizing basic approaches and concepts. But I'd like to at least expose them to scaling with dask, as some may need for their final projects and future research.

I'm planning to have all students launch pangeo binder on aws us-west-2 and work through some of the sample notebooks. I'll probably have the students stick with local clusters for now, but I may try to do a demo with kubecluster for the LS8 notebook.

I wish I had seen these AGU 2019 materials sooner, they are excellent! Thanks to @scottyhq for last-minute advice. Will plan to connect and integrate earlier next year.

mktippett commented 3 years ago

Just now it took 21 minutes to start. Maybe I'm doing it wrong.

Thanks!

betolink commented 3 years ago

We are presenting a data tutorial at AGU and will be using Pangeo's Binderhub

No Dask clusters, the users will be working on small data access and subsetting capabilities (CMR, Harmony, EGI, AWS S3)

jhamman commented 8 months ago

This was a great issue but is not longer how AGU resources are being coordinated. Thanks all who participated here.