pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
693 stars 187 forks source link

Educational events on Pangeo resources #440

Closed mrocklin closed 7 months ago

mrocklin commented 5 years ago

Hello educator!

We're pleased to hear that you're interested in using Pangeo's cloud deployments for an educational event. To make sure that things run smoothly we ask that you post the following information here before your event:

This helps us both by ensuring that the cluster is sufficiently large during your event (otherwise your students may not get as many cores as they expect) and by providing us information to give back to our funding agencies about how their resources benefit the public.

Edit

For educators wishing to use this cluster, you may want to pre-allocate a bunch of VMs before your students arrive. This will make sure that VMs are around when they try to log on. Otherwise they might have to wait a few minutes while Google gives us machines.

Typically I do this by logging into a Jupyter notebook, and then allocating a fairly large cluster. To do this I need to overwrite the default maximum number of allowed workers.

dask.config.set({'kubernetes.count.max': 1000})
cluster.scale(1000)
# wait a bit until they arrive
cluster.scale(0) # release the pods back to the wild, the VMs should stick around for a bit

This forces the worker node pool to grow, and then those workers stick around for a bit. It may take a while for the cloud to give us enough machines. I would do this at least 30m before the tutorial start, and possibly an hour before. You can track progress by watching the IPython widget output of KubeCluster, which should update live.

You definitely want torelease the pods back to the wild before the tutorial starts, but not too soon before, otherwise the cloud provider will clean up the VMs. Maybe run scale(0) a minute before things start off (but in practice you should have 10-20 minutes grace period here).

mrocklin commented 5 years ago

I used binder.pangeo.io to present a tutorial at PyData NYC on 2018-10-22. My apologies for including this late. Requested information:

mrocklin commented 5 years ago

I would like to use binder.pangeo.io to present a Dask tutorial at PyData DC on 2018-11-16. Requested information:

xhochy commented 5 years ago

I would like to use binder.pangeo.io to present a Dask tutorial at PyCon.DE on 2018-10-24. Requested information:

rabernat commented 5 years ago

Thanks for doing this Matt!

TomAugspurger commented 5 years ago

I would like to use binder.pangeo.io to present a Dask tutorial at ODSC West on 2018-11-01. Requested information:

I'm also happy to set up the infrastructure separately from the pangeo binder. This would mostly just be if you want additional testers of the binder infrastructure to work out issues.

mrocklin commented 5 years ago

During @xhochy 's tutorial we again ran into resourcing problems. This time the worker node pool did not expand to its full capacity. I'm not sure why. To resolve this I allocated all of the non-preemptible nodes I could (arond 500 cores) for the duration of the tutorial.

There was also some concern when first starting up notebooks and clusters. It's awkward to have the notebook sit idle waiting for workers for a few minutes while VMs start up. Educators might want to ask for many VMs just before class by creating a very large cluster (for example a cluster with 1000 workers). We intentionally make this difficult to do by adding a limit to dask-cluster size. You can override this limit with the following code:

with dask.config.set({'kubernetes.count.max': 1000}):
    cluster.scale(1000)
TomAugspurger commented 5 years ago

@rabernat, @mrocklin would you appreciate additional beta testers of this setup? If so I'll start to ensure my materials for the tutorial next week work on the binder infrastructure.

The tutorial is actually titled "Cloud Native Data Science with Dask", so I plan to spend about 20 minutes walking through the how the clusters were actually deployed for the attendees.

jhamman commented 5 years ago

@TomAugspurger - Please go ahead as planned and use binder.pangeo.io. We're still learning how our binderhub handles these events so if you don't mind being a beta tester, we'd appreciate the pressure. I think @dsludwig and I will be working on the deployment over the next week a bit but we'll keep things in a stable state for the day of your tutorial.

rabernat commented 5 years ago

I have an event coming up

mrocklin commented 5 years ago

@TomAugspurger , how did things go?

TomAugspurger commented 5 years ago

Pretty much perfectly.. It was about 40 people, and I the Dask workers came up instantly for people.

I didn't watch to see if / how scaling down went.

mrocklin commented 5 years ago

For background, just before this workshop happened I logged in and scaled up to 1000 workers, and then back down again. To do this I had to break the built-in worker limit by setting a config parameter

dask.config.set({'kubernetes.count.max': 1000})
cluster.scale(1000)
# wait a bit
cluster.scale(0)

This forces the worker node pool to grow, and then those workers stick around for a bit.

mrocklin commented 5 years ago

I also had to force things to scale down manually. We still have the problem of fluentd, prometheus, and some other small pod keeping the worker-pool nodes awake.

mmccarty commented 5 years ago

I would like to use binder.pangeo.io to present a Dask tutorial at Capital One: C4ML on 12/13/18. Requested information:

lesteve commented 5 years ago

I would like to use binder.pangeo.io to present a Dask tutorial at PyParis 2018 on 2018-11-15. Requested information:

mrocklin commented 5 years ago

Thanks for the information @lesteve . I recommend that you follow the procedure in https://github.com/pangeo-data/pangeo/issues/440#issuecomment-435213971 to ensure that there are some VMs provisioned when your students arrive. I'll make sure that things spin down afterwards.

jcrist commented 5 years ago

We (@martindurant and myself) would like to use binder.pangeo.io to present a Dask tutorial at PyData DC on Friday (Nov 16th) from 11:00 am - 12:30 pm EST. We plan to use these materials: https://github.com/mrocklin/pydata-nyc-2018-tutorial. I am unsure of the number of attendees, I have not been provided this information.

mmccarty commented 5 years ago

@jcrist @martindurant Just to give you a ball park estimate. The PyDataDC tutorials are sold out at 150 and there is only 1 other tutorial going on at that time. I would plan for around 75.

lesteve commented 5 years ago

Just a quick feed-back after the dask tutorial yesterday at PyParis. Running the dask tutorial through the pangeo binder setup went flawlessly, that was really impressive!

Following https://github.com/pangeo-data/pangeo/issues/440#issuecomment-435213971:

There were around 40 people and they were able to get their 20 workers instantly during the tutorial.

guillaumeeb commented 5 years ago

I plan to use binder.pangeo.io to present a Dask tutorial at http://www.irt-saintexupery.com/. Requested information:

jhamman commented 5 years ago

@rabernat, @scottyhq and I will be giving a Pangeo tutorial on 12/12 at the 2018 AGU Fall Meeting. (xref: #468).

darothen commented 5 years ago

I'll be giving a short (30-minute) tutorial during the AMS meeting. I don't expect very many (read: really, anyone) people to follow along, but it's still possible, so I want to record things here:

mrocklin commented 5 years ago

Users should feel free to use as many cores as we like (I think we cap them at something like 50 or 100). At that small size of audience I wouldn't worry too much.

You may want to go through the procedure mentioned above where you allocate a large cluster just before class just to make sure that there are some workers around. If you don't do this then users may have to wait a few minutes before things spin up, but that should be ok too if you inform them that that we're waiting for a few VMs to show up from Google.

On Sat, Jan 5, 2019 at 6:47 PM Daniel Rothenberg notifications@github.com wrote:

I'll be giving a short (30-minute) tutorial during the AMS meeting. I don't expect very many (read: really, anyone) people to follow along, but it's still possible, so I want to record things here:

  • Date and time: January 7, 2019 at 3-3:30 PM MDT
  • Information about the event: I will give a short live demo of scaling an analysis from a test dataset on my laptop to running on a larger one (~25 GB) on the cloud
  • Link to materials: TBD, plan on posting tomorrow
  • Number of attendees and resources per attendee: I expect there to be ~25-50 people in the audience, and possibly 10 people who actively try to follow along. I will ask people not to use more than 10 cores.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/440#issuecomment-451710470, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJHQoiucYo5y2CmiZTgsh9-LUEgYks5vAWPSgaJpZM4XzruO .

rabernat commented 5 years ago

FYI I will be doing an impromptu demo at Oxford in a few hours using Pangeo binder. Expect some traffic on the cluster, not sure how much.

jakirkham commented 5 years ago

I would like to use binder.pangeo.io to present a Dask tutorial for the Advanced Scientific Programming in Python, Asia-Pacific Summer School.

Requested information:

mrocklin commented 5 years ago

Cool. @jakirkham are you comfortable with the process in https://github.com/pangeo-data/pangeo/issues/440#issuecomment-435213971 ?

Also, feel free to have them use more than just a few cores. We're happy to spend our free compute credits for education and evangelism.

jakirkham commented 5 years ago

Was just looking at that. SGTM. Thanks. Should we add it to the OP?

Great, thanks @mrocklin. I think the students will find this really cool. :)

mrocklin commented 5 years ago

Was just looking at that. SGTM. Thanks. Should we add it to the OP?

Good idea. Done!

guillaumeeb commented 5 years ago

I'd like to use binder.pangeo.io to present a Dask tutorial for the Observatoire Midi Pyrénées lab .

jcrist commented 5 years ago

I'd like to use binder.pangeo.io to present a Dask tutorial on April 3rd from 1:30 to 5:00 PM Central time at AnacondaCon. There will be ~100 attendees. I'm not quite done with materials yet, but these will be some combination of our PyCon tutorial (https://github.com/TomAugspurger/dask-tutorial-pycon-2018) and the PyData NYC tutorial (https://github.com/mrocklin/pydata-nyc-2018-tutorial). I'll update this comment with a link when the materials are finished.

Edit: materials are here: https://github.com/jcrist/anacondacon-2019-tutorial

rabernat commented 5 years ago

@jcrist - 100 attendees each with their own KubeCluster could get pretty big! We have recently become more conscious of our cloud burn rate, which was unsustainably high for a while.

Please go ahead with your tutorial. We want this resource to be used, especially for educational purposes. Just try to be conscientious about scale when you have 100 simultaneous users.

jcrist commented 5 years ago

Thanks @rabernat. I've scaled down the cluster size to a max of 10 workers each (default for previous tutorials was 20), with fewer for simpler notebooks to try and combat this. I'm willing to go smaller to help conserve resources, I don't want to strain the cloud resources of such a useful community project.

mrocklin commented 5 years ago

FWIW I think that these tutorials are valuable to drum up interest and my guess (though not very well informed at the moment) is that they're a small amount relative to general use, particularly because they're one-off rather than continuous. I don't know though.

rabernat commented 5 years ago

To be clear, I am 100% 👍 on the tutorial. I agree they have very high value.

It might be useful to use this as an opportunity to figure out how much these tutorials cost. People frequently ask me that, and I don't have an answer beyond "not very much."

Dask workers go into a nodepool with n1-highmem-32 preemptible instances. These have 32 vCPUs and 208 GB of memory. They cost $0.40 per hour. We could make a back-of-the-envelope estimate and then verify from the logs after the tutorial.

rabernat commented 5 years ago

btw, @jcrist - you might want to pop in on https://github.com/pangeo-data/pangeo-binder/issues/37 - @guillaumeeb is reporting that users at his tutorial are losing notebook. Could be because we are using preemptible node pools.

This could affect your tutorial tomorrow.

guillaumeeb commented 5 years ago

Hi everyone, my tutorial today went pretty well, with public going from governemental agencies (CNES, Ifremer), spatial industry (CLS which is working on altimetry products), or labs (CESBIO for spatial imagery, Legos for Ocean, GET for earth science ...).

I've encountered some small issues:

I think this repositories for tutorial are great, and we should find a way to maintain them. This is probably linked to #575. I'm talking about https://github.com/pangeo-data/pangeo-tutorial-agu-2018 and https://github.com/mrocklin/pydata-nyc-2018-tutorial.

rsignell-usgs commented 5 years ago

@guillaumeeb would it be worth adding the mrocklin pydata tutorial to the list of tutorials at https://github.com/pangeo-data/awesome-open-climate-science#tutorials or was that mostly eclipsed by the AGU tutorial?

We don't want to bombard people with too many similar tutorials (or worse, just old versions of basically the same tutorial), but if it takes a different or complementary approach, that should be summarized and included in the awesome list, right?

guillaumeeb commented 5 years ago

@rsignell-usgs just added it yesterday to https://github.com/pangeo-data/education-material. However, I don't think it is appropriate for a climate science use cases list.

But maybe these two lists are overlapping...

And I fully support the fact we don't want to have too many simular tutorial, but I think we do currently.

mrocklin commented 5 years ago

It looks like binderhub has a config option that limits the number of users per repository.

binderhub:
  config:
      per_repo_quota: 100

Larger tutorials might run into this. We might want to change this number (or not) at some point (though preferably not within the next few hours).

jhamman commented 5 years ago

@mrocklin - I'd be fine increasing this limit. If you or @jcrist can open an issue on pangeo-binder, we can discuss more there.

jacobtomlinson commented 5 years ago

I'd like to use the binder deployment for two tutorials on the 1st and 2nd of May (a weekend). Will likely be around 10am GMT on each day for a couple of hours to an audience of 10-20 people. I'm intending on using the material @jcrist prepared for AnacondaCon.

rsignell-usgs commented 5 years ago

Thanks for the tip to: https://github.com/jcrist/anacondacon-2019-tutorial

robfatland commented 5 years ago

I'd like to use binder.pangeo.io to present a tutorial at Northwest Data Science Summit.

jhamman commented 5 years ago

@robfatland - sounds good. See @mrocklin's comments earlier in this thread for some tips on helping scale up/down your cluster efficiently.

jmunroe commented 5 years ago

I'm doing a day-long workshop / training on Pangeo as part of C3DIS (Canberra). I've got a Pangeo / BinderHub deployment set up on AWS but I would like to reserve binder.pangeo.io as a fallback in case my AWS deployment turns out to not scale.

Date and time: 2019-05-09 between 9 am and 5 pm AEST (after @robfatland training!) Information about the event: Pangeo Tutorial at C3DIS, Canberra Link to materials: We plan to use these materials: https://github.com/jmunroe/pangeo-tutorial-c3dis-2019 (rebranded from AGU 2018 tutorials) Number of attendees and resources per attendee: 25 participants x 20 workers each.

robfatland commented 5 years ago

As note also on slack: Practice run hitting a snag, here are details:

Ryan's sea level notebook running in binder is hanging fire on the cell below Visually Examine Some Of The Data. The prior Initialize Dataset cell ran fine. The KubeCluster cell runs but gives this error:

/srv/conda/lib/python3.6/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
  warnings.warn('\n' + msg)

The dask task stream remains empty. Shutting down and trying again produces this error from the same (cluster) cell:

/srv/conda/lib/python3.6/site-packages/dask_kubernetes/config.py:13: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  defaults = yaml.load(f)

As before that "sanity check" cell hangs.

Should I be setting something in the KubeCluster widget? I tried Manual Scaling 7 workers to no avail.

guillaumeeb commented 5 years ago

@robfatland, you output traces are just warning and not errors, this should not prevent notebooks to work fine.

First one says your scheduler has started on another port because default one was used (which is to be expected during a tutorial). Have you tried opening the dashboard in another window ?

Second is juste a deprecation warning and should have non impact.

Does the cluster widget shows you allocated cores?

robfatland commented 5 years ago

Tried again, anonymous browser. This is working. Initially I get workers: 0 cores: 0 memory: 0. Thinking this was incorrect I set Manual Scaling to 10 but later realized this is an unnecessary step if one is happy with the default 20 workers. Anyway now: The first user gets a pause while things fire up and then everything goes. My second user seems to fire up faster.

And by the way this is just friggin' awesome.

mrocklin commented 5 years ago

@robfatland I recommend priming the cluster with some VMs. You may want to read the edit in the top post of this issue.

rabernat commented 5 years ago

I think having some visualization of what the cluster is doing would go a long way towards alleviating user / instructor anxiety during these tutorials. I'm talking about a basic visual representation of the node / pod information that kubectl can provide. Ideally this could be a tab in Jupyterlab, much like the dask extension.

This is motivated by our experience with the dask dashboard. Users are happy to wait for things if they have feedback about what the computers are doing. But just waiting with no info / progress causes anxiety and confusion.