pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
703 stars 189 forks source link

Eye candy request for "Who Uses Dask?" presentation #778

Closed mrocklin closed 4 years ago

mrocklin commented 4 years ago

Hi Everyone,

I'm putting together a "Who uses Dask?" presentation for AnacondaCon this year and would like to include a Pangeo section.

If people here are willing, I'd love to get the following information from a few folks:

No pressure, but if people have these things lying around and can send them in the next couple of days I think that it would be great to include them in the talk.  I have to, compile, record, and submit a recording by May 25th.

Best, -matt

willirath commented 4 years ago
  • Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

GEOMAR, Kiel, Germany (Logos are here)

  • A range of data sizes from your domain

Gigabytes to 100s of Terabytes.

  • What did people use before Pangeo
  • Favorite feature of Dask

Easy deployment across all platforms.

  • Biggest pain point of Dask

Memory management with Dask array.

dcherian commented 4 years ago

A high resolution image or two that can serve as backdrop for a slide or webpage. I'm looking for maximal eye candy here :) Preferably something scienc-y rather than dask-y. No dashboard/notebook images, but maybe something showing the world's oceans or atmosphere in interesting ways, particularly if it feels like something you'd see in a magazine. Doesn't need to have been made with Dask or Python.

snapshot of simulated sea surface temperature in the eastern tropical pacific. This highlights the "cold tongue" (blue is cold; white is warm). the aspect ratio is not the best for a slide though.

http://cherian.net/static/eq-pac-sst.png

Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

NCAR ; logos here: https://drive.google.com/drive/folders/1d7lBCqFSF7ZglkScMbMD7P8_WiIypvtj

A range of data sizes from your domain

This one is 6TB. We have another one that is 14TB

What did people use before Pangeo

MATLAB, NCL

Favorite feature of Dask

  1. integration with xarray (is this cheating? :P)
  2. just set up a cluster and go. Code stays the same whether I work on laptop/cloud/HPC.
  3. usually don't need to think too much about the details of the parallelism

Biggest pain point of Dask

too easy to generate graphs that overwhelm the scheduler or use too much memory (usually when slicing and/or rechunking)

apatlpo commented 4 years ago

I'm not sure whether you are adressing Pangeo "core" members or not with this post. In any case, we have a group at Ifremer that exclusively work with xarray/pandas/dask tools and strongly benefit from Pangeo:

Ifremer - french national marine institute - logo

  • A range of data sizes from your domain

from 10GB to 10TB dask DataFrames - chunked xarray multidimensional arrays dask-distributed - dask-jobqueue - HPC cluster

  • What did people use before Pangeo

numpy / mpi4py

  • Favorite feature of Dask
  • Biggest pain point of Dask
auraoupa commented 4 years ago

Hi Matt,

Another french contribution

A pretty picture of modelled surface currents can be downloaded here : https://cloud.ocean-next.fr/s/AKSW9a2wwj3Xeme, the corresponding video is https://vimeo.com/300943265 (see video details for more information on the simulation)

Logos are already on the picture, the main institutions are IGE lab (http://www.ige-grenoble.fr/) and Ocean Next company (https://www.ocean-next.fr/).

Our entire dataset is almost 2Pb, we daily deal with terabytes of data to produce scientific analysis and for developing sub griclosure with machine learning applied to coarse grained model data (dask is used for preprocessing)

Fortran tools, cdo for preprocessing, matlab, python, NCL for plots

Easily deployed, make it easy to scale up an analysis pipeline from laptop to supercomputer, the dashboard

Memory issues

jbusecke commented 4 years ago

Hi Matt,

Visualisation_ScienceArt-JBusecke This picture shows the gradient of surface Temperature and how mesoscale eddies enhance the gradient (blue is strong gradient, white is weak). This was actually done with dask hehe.

Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

This was part of my PhD at Columbia University in collaboration with @rabernat

A range of data sizes from your domain

I am using a large variety of datasets, from Megabytes (Ship sections), up to ~PB scale datasets like CMIP6 (which I am working on currently). I am using dask for all but the smallest datasets, usually in combination with xarray.

What did people use before Pangeo

I used mostly MATLAB. The moment I decided to completely switch to python was when @rabernat showed me how to use dask's map_blocks, which made my analysis run in 17 min instead of hours!

Favorite feature of Dask

Hands down the integration with xarray. The fact that I can develop analyses on small datasets and usually just scale them up to medium sized datasets (GB-TB) without much hassle is amazing.

Biggest pain point of Dask

In my current work with CMIP6 I am struggling a lot with problems related to memory overload/killed workers. For now these break the 'scalability' with very large datasets (like many models with several members), and I have to work around these by e.g. precomputing single variables/members or other subsets of the data.

mrocklin commented 4 years ago

That image is beautiful

On Fri, May 22, 2020 at 10:17 AM Julius Busecke notifications@github.com wrote:

Hi Matt,

[image: Visualisation_ScienceArt-JBusecke] https://user-images.githubusercontent.com/14314623/82692038-f5874b00-9c2c-11ea-963e-513daa64eb7d.png This picture shows the gradient of surface Temperature and how mesoscale eddies enhance the gradient (blue is strong gradient, white is weak). This was actually done with dask hehe.

Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

This was part of my PhD at Columbia University in collaboration with @rabernat https://github.com/jbusecke/busecke_abernathey_2019_sciadv

A range of data sizes from your domain

I am using a large variety of datasets, from Megabytes (Ship sections), up to ~PB scale datasets like CMIP6 (which I am working on currently). I am using dask for all but the smallest datasets, usually in combination with xarray.

What did people use before Pangeo

I used mostly MATLAB. The moment I decided to completely switch to python was when @rabernat https://github.com/rabernat showed me how to use dask's map_blocks, which made my analysis run in 17 min instead of hours!

Favorite feature of Dask

Hands down the integration with xarray. The fact that I can develop analyses on small datasets and usually just scale them up to medium sized datasets (GB-TB) without much hassle is amazing.

Biggest pain point of Dask

In my current work with CMIP6 I am struggling a lot with problems related to memory overload/killed workers https://github.com/dask/distributed/issues/2602. For now these break the 'scalability' with very large datasets (like many models with several members), and I have to work around these by e.g. precomputing single variables/members or other subsets of the data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/778#issuecomment-632814420, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFDUCJL3THYNAFRANTRS2XUNANCNFSM4NEOZMVQ .

rabernat commented 4 years ago

I'm not sure whether you are adressing Pangeo "core" members or not with this post.

@apatlpo - No such thing really exists. 😄

dgergel commented 4 years ago

Rhodium Group (logo) and the Climate Impact Lab (logo)

image

This figure is from Hsiang et al (2017) and shows the county level median values for average 2080-2099 RCP 8.5 impacts, relative to “no additional climate change” trajectories. This was not done with dask but is an example of the type of work that we do.

Our data ranges from gigabytes to hundreds of terabytes.

Before pangeo (in this context, really Kubernetes and google cloud), we used supercomputing resources available through the University of California Berkeley (one of the Climate Impact Lab partners). Lots of slurm batch jobs. For post-processing and analysis, NCO, CDO, and other similar tools.

We use dask (and kubernetes) very extensively for our computing, and it allows for flexible, scalable computing environments that are compatible across our various workflows, which vary substantially between projects and client needs. Much of our analysis on very large datasets would not be possible without dask and other tools that are part of the pangeo stack.

One of our biggest pain points is figuring out the best way to make jobs sufficiently small to not encounter memory issues while also not overwhelming the scheduler.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.