Eye candy request for "Who Uses Dask?" presentation

pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.

http://pangeo.io

703 stars 189 forks source link

Eye candy request for "Who Uses Dask?" presentation #778

Closed mrocklin closed 4 years ago

mrocklin commented 4 years ago

Hi Everyone,

I'm putting together a "Who uses Dask?" presentation for AnacondaCon this year and would like to include a Pangeo section.

If people here are willing, I'd love to get the following information from a few folks:

A high resolution image or two that can serve as backdrop for a slide or webpage. I'm looking for maximal eye candy here :) Preferably something scienc-y rather than dask-y. No dashboard/notebook images, but maybe something showing the world's oceans or atmosphere in interesting ways, particularly if it feels like something you'd see in a magazine. Doesn't need to have been made with Dask or Python.
Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.
A range of data sizes from your domain
What did people use before Pangeo
Favorite feature of Dask
Biggest pain point of Dask

No pressure, but if people have these things lying around and can send them in the next couple of days I think that it would be great to include them in the talk. I have to, compile, record, and submit a recording by May 25th.

Best, -matt

willirath commented 4 years ago

Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

GEOMAR, Kiel, Germany (Logos are here)

heavily uses Dask Array (mostly via Xarray) and the whole Pangeo stack for analysis of Ocean model data mostly on HPC or other dedicated compute servers
and has adopted Dask (using delayed, bag, etc.) as an easy to use tool for all sorts of parallelisation.

A range of data sizes from your domain

Gigabytes to 100s of Terabytes.

What did people use before Pangeo

For the Terabyte range, we used home-rolled standalone tools like NCO, CDO and more domain-specific things with lots of getting lost in layers of processed data on disk.
For bigger-than-memory stuff, many used Ferret.

Favorite feature of Dask

Easy deployment across all platforms.

Biggest pain point of Dask

Memory management with Dask array.

dcherian commented 4 years ago

A high resolution image or two that can serve as backdrop for a slide or webpage. I'm looking for maximal eye candy here :) Preferably something scienc-y rather than dask-y. No dashboard/notebook images, but maybe something showing the world's oceans or atmosphere in interesting ways, particularly if it feels like something you'd see in a magazine. Doesn't need to have been made with Dask or Python.

snapshot of simulated sea surface temperature in the eastern tropical pacific. This highlights the "cold tongue" (blue is cold; white is warm). the aspect ratio is not the best for a slide though.

http://cherian.net/static/eq-pac-sst.png

Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

NCAR ; logos here: https://drive.google.com/drive/folders/1d7lBCqFSF7ZglkScMbMD7P8_WiIypvtj

A range of data sizes from your domain

This one is 6TB. We have another one that is 14TB

What did people use before Pangeo

MATLAB, NCL

Favorite feature of Dask

integration with xarray (is this cheating? :P)
just set up a cluster and go. Code stays the same whether I work on laptop/cloud/HPC.
usually don't need to think too much about the details of the parallelism

Biggest pain point of Dask

too easy to generate graphs that overwhelm the scheduler or use too much memory (usually when slicing and/or rechunking)

apatlpo commented 4 years ago

I'm not sure whether you are adressing Pangeo "core" members or not with this post. In any case, we have a group at Ifremer that exclusively work with xarray/pandas/dask tools and strongly benefit from Pangeo:

A high resolution image Satellite ocean color images can be quite spetacular:

[bloom in the baltic sea](https://earthobservatory.nasa.gov/images/86449/blooming-baltic

eddies in the tasman sea search for "Spring Blooms around Southeastern Australia" in webpage

Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

Ifremer - french national marine institute - logo

A range of data sizes from your domain

from 10GB to 10TB dask DataFrames - chunked xarray multidimensional arrays dask-distributed - dask-jobqueue - HPC cluster

What did people use before Pangeo

numpy / mpi4py

Favorite feature of Dask

dask DataFrames are the bomb to work on large tabular data !
"automatic" distribution and associated (dramatic) reduction in number of lines of code (and hence time spent coding)
proactive team that reaches out to earth siences (not really a feature but needs to be said)

Biggest pain point of Dask

scheduling large number of tasks may be a bottleneck for large datasets
difficult to interpret dask graphs and act on them (in order to change tasks processing order for example). This is critical for applications where one cannot afford to load all data in memory.

auraoupa commented 4 years ago

Hi Matt,

Another french contribution

A high resolution image or two that can serve as backdrop for a slide or webpage. I'm looking for maximal eye candy here :) Preferably something scienc-y rather than dask-y. No dashboard/notebook images, but maybe something showing the world's oceans or atmosphere in interesting ways, particularly if it feels like something you'd see in a magazine. Doesn't need to have been made with Dask or Python.

A pretty picture of modelled surface currents can be downloaded here : https://cloud.ocean-next.fr/s/AKSW9a2wwj3Xeme, the corresponding video is https://vimeo.com/300943265 (see video details for more information on the simulation)

Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

Logos are already on the picture, the main institutions are IGE lab (http://www.ige-grenoble.fr/) and Ocean Next company (https://www.ocean-next.fr/).

A range of data sizes from your domain

Our entire dataset is almost 2Pb, we daily deal with terabytes of data to produce scientific analysis and for developing sub griclosure with machine learning applied to coarse grained model data (dask is used for preprocessing)

What did people use before Pangeo

Fortran tools, cdo for preprocessing, matlab, python, NCL for plots

Favorite feature of Dask

Easily deployed, make it easy to scale up an analysis pipeline from laptop to supercomputer, the dashboard

Biggest pain point of Dask

Memory issues

jbusecke commented 4 years ago

Hi Matt,

Visualisation_ScienceArt-JBusecke This picture shows the gradient of surface Temperature and how mesoscale eddies enhance the gradient (blue is strong gradient, white is weak). This was actually done with dask hehe.

Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

This was part of my PhD at Columbia University in collaboration with @rabernat

A range of data sizes from your domain

I am using a large variety of datasets, from Megabytes (Ship sections), up to ~PB scale datasets like CMIP6 (which I am working on currently). I am using dask for all but the smallest datasets, usually in combination with xarray.

What did people use before Pangeo

I used mostly MATLAB. The moment I decided to completely switch to python was when @rabernat showed me how to use dask's map_blocks, which made my analysis run in 17 min instead of hours!

Favorite feature of Dask

Hands down the integration with xarray. The fact that I can develop analyses on small datasets and usually just scale them up to medium sized datasets (GB-TB) without much hassle is amazing.

Biggest pain point of Dask

In my current work with CMIP6 I am struggling a lot with problems related to memory overload/killed workers. For now these break the 'scalability' with very large datasets (like many models with several members), and I have to work around these by e.g. precomputing single variables/members or other subsets of the data.

mrocklin commented 4 years ago

That image is beautiful

On Fri, May 22, 2020 at 10:17 AM Julius Busecke notifications@github.com wrote:

Hi Matt,

[image: Visualisation_ScienceArt-JBusecke] https://user-images.githubusercontent.com/14314623/82692038-f5874b00-9c2c-11ea-963e-513daa64eb7d.png This picture shows the gradient of surface Temperature and how mesoscale eddies enhance the gradient (blue is strong gradient, white is weak). This was actually done with dask hehe.

Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.

This was part of my PhD at Columbia University in collaboration with @rabernat https://github.com/jbusecke/busecke_abernathey_2019_sciadv

A range of data sizes from your domain

I am using a large variety of datasets, from Megabytes (Ship sections), up to ~PB scale datasets like CMIP6 (which I am working on currently). I am using dask for all but the smallest datasets, usually in combination with xarray.

What did people use before Pangeo

I used mostly MATLAB. The moment I decided to completely switch to python was when @rabernat https://github.com/rabernat showed me how to use dask's map_blocks, which made my analysis run in 17 min instead of hours!

Favorite feature of Dask

Hands down the integration with xarray. The fact that I can develop analyses on small datasets and usually just scale them up to medium sized datasets (GB-TB) without much hassle is amazing.

Biggest pain point of Dask

In my current work with CMIP6 I am struggling a lot with problems related to memory overload/killed workers https://github.com/dask/distributed/issues/2602. For now these break the 'scalability' with very large datasets (like many models with several members), and I have to work around these by e.g. precomputing single variables/members or other subsets of the data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/778#issuecomment-632814420, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFDUCJL3THYNAFRANTRS2XUNANCNFSM4NEOZMVQ .

rabernat commented 4 years ago

I'm not sure whether you are adressing Pangeo "core" members or not with this post.

@apatlpo - No such thing really exists. 😄

dgergel commented 4 years ago

Rhodium Group (logo) and the Climate Impact Lab (logo)

This figure is from Hsiang et al (2017) and shows the county level median values for average 2080-2099 RCP 8.5 impacts, relative to “no additional climate change” trajectories. This was not done with dask but is an example of the type of work that we do.

Our data ranges from gigabytes to hundreds of terabytes.

Before pangeo (in this context, really Kubernetes and google cloud), we used supercomputing resources available through the University of California Berkeley (one of the Climate Impact Lab partners). Lots of slurm batch jobs. For post-processing and analysis, NCO, CDO, and other similar tools.

We use dask (and kubernetes) very extensively for our computing, and it allows for flexible, scalable computing environments that are compatible across our various workflows, which vary substantially between projects and client needs. Much of our analysis on very large datasets would not be possible without dask and other tools that are part of the pangeo stack.

One of our biggest pain points is figuring out the best way to make jobs sufficiently small to not encounter memory issues while also not overwhelming the scheduler.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.