pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
698 stars 188 forks source link

My experience with Pangeo thus far #152

Closed rsignell-usgs closed 6 years ago

rsignell-usgs commented 6 years ago

@mrocklin asked me to comment on my experience with Pangeo thus far, so here goes...

Here's why I like using pangeo:

It's a great environment for exploring scalable, data-proximate analysis and cloud-based storage issues.

Most of our modelers run simulations remotely on HPC and then drag the data back to their desktops for analysis. The researcher two doors down from me is simulating tides in an estuary, and so far has produced 90 runs. Each run is 150GB and takes 8 hours to download over the USGS secure network. We clearly need remote analysis capabilities!

Pangeo.pydata.org provides a great environment with which to explore new workflows using Dask clusters to perform scalable analysis next to the data. I love that I can jump on there with my github account, and everything is set up and ready to go, while we continue to work on getting Dask and JuptyerHub going with Kubernetes on XSEDE Jetstream. It's also been great for exploring different data storage approaches like Zarr and HSDS.

Another useful feature is that pangeo.pydata.org is on a fat pipe, so it's speedy to process data services. My colleague @ocefpaf in Brazil was trying to run a notebook that downloads a fair amount of data, and it wasn't working well with the bandwidth he had. He moved the notebook to pangeo.pydata.org and got it going in short order. Here's a snapshot: 2018-03-06

In addition to the cloud instance and cloud issues, the pangeo guide for data-proximate, scalable analysis on HPC was fantastic. The description of setting up the python environment, for submitting a dask cluster job, connecting with Jupyter, and then accessing the client via SSH tunneling on a local browser worked perfectly on the USGS HPC. The only change was the job control script, because we use Slurm instead of PBS. Here's a snapshot from a modified version of this notebook by @jhamman, which we used to compute the maximum wave height at each grid cell over the duration of a Hurricane Sandy simulation: 2018-03-08_7-43-09

I don't need to do system administration. I like that someone else is taking care of keeping the components of the framework up-to-date, fixing bugs, etc. I can just be a science tester/user.

Pangeo is a great community. I've found out about things via pangeo that I didn't know about, like Zarr and daskernetes, and meeting researchers with similar big data issues.

mrocklin commented 6 years ago

This is a great report. Thanks @rsignell-usgs . Now can I ask, what don't you like about this system?.

If say, a well funded government agency were to come by, where would you ask them to devote their resources?

rsignell-usgs commented 6 years ago

Now can I ask, what don't you like about this system?

These are not really things I don't like, but things I haven't figured out yet:

I'm also wondering what it would take for folks to use this for their day-to-day compute environment. Certainly some way to track use and have folks pay, right? Perhaps with some subscription plan?

It would also be great to see additional language interfaces for storage solutions like Zarr and HSDS, so that they might have a chance as de facto standards for Cloud-optimized multidimensional data.

Perhaps https://github.com/constantinpape/z5 is a start?

If say, a well funded government agency were to come by, where would you ask them to devote their resources?

I'm not sure what the priorities of DOD would be. :smile_cat:

jacobtomlinson commented 6 years ago

I'm particularly interested in some of these questions about how to allocate cost to users. Things like:

I'm currently looking into much of the customization stuff now. We are finding Jupyter Lab rather inflexible and if we wish to allow users to add custom extensions it basically requires them to reinstall Jupyter Lab.

rabernat commented 6 years ago

@rsignell-usgs -- I'm happy to give you full access to our GCS allocation. Just give me your google username.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.