Closed mrocklin closed 4 years ago
- Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.
GEOMAR, Kiel, Germany (Logos are here)
heavily uses Dask Array (mostly via Xarray) and the whole Pangeo stack for analysis of Ocean model data mostly on HPC or other dedicated compute servers
and has adopted Dask (using delayed, bag, etc.) as an easy to use tool for all sorts of parallelisation.
- A range of data sizes from your domain
Gigabytes to 100s of Terabytes.
- What did people use before Pangeo
For the Terabyte range, we used home-rolled standalone tools like NCO, CDO and more domain-specific things with lots of getting lost in layers of processed data on disk.
For bigger-than-memory stuff, many used Ferret.
- Favorite feature of Dask
Easy deployment across all platforms.
- Biggest pain point of Dask
Memory management with Dask array.
A high resolution image or two that can serve as backdrop for a slide or webpage. I'm looking for maximal eye candy here :) Preferably something scienc-y rather than dask-y. No dashboard/notebook images, but maybe something showing the world's oceans or atmosphere in interesting ways, particularly if it feels like something you'd see in a magazine. Doesn't need to have been made with Dask or Python.
snapshot of simulated sea surface temperature in the eastern tropical pacific. This highlights the "cold tongue" (blue is cold; white is warm). the aspect ratio is not the best for a slide though.
http://cherian.net/static/eq-pac-sst.png
Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.
NCAR ; logos here: https://drive.google.com/drive/folders/1d7lBCqFSF7ZglkScMbMD7P8_WiIypvtj
A range of data sizes from your domain
This one is 6TB. We have another one that is 14TB
What did people use before Pangeo
MATLAB, NCL
Favorite feature of Dask
Biggest pain point of Dask
too easy to generate graphs that overwhelm the scheduler or use too much memory (usually when slicing and/or rechunking)
I'm not sure whether you are adressing Pangeo "core" members or not with this post. In any case, we have a group at Ifremer that exclusively work with xarray/pandas/dask tools and strongly benefit from Pangeo:
A high resolution image Satellite ocean color images can be quite spetacular:
[bloom in the baltic sea](https://earthobservatory.nasa.gov/images/86449/blooming-baltic
eddies in the tasman sea search for "Spring Blooms around Southeastern Australia" in webpage
Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.
Ifremer - french national marine institute - logo
- A range of data sizes from your domain
from 10GB to 10TB dask DataFrames - chunked xarray multidimensional arrays dask-distributed - dask-jobqueue - HPC cluster
- What did people use before Pangeo
numpy / mpi4py
- Favorite feature of Dask
- Biggest pain point of Dask
Hi Matt,
Another french contribution
A pretty picture of modelled surface currents can be downloaded here : https://cloud.ocean-next.fr/s/AKSW9a2wwj3Xeme, the corresponding video is https://vimeo.com/300943265 (see video details for more information on the simulation)
Logos are already on the picture, the main institutions are IGE lab (http://www.ige-grenoble.fr/) and Ocean Next company (https://www.ocean-next.fr/).
Our entire dataset is almost 2Pb, we daily deal with terabytes of data to produce scientific analysis and for developing sub griclosure with machine learning applied to coarse grained model data (dask is used for preprocessing)
Fortran tools, cdo for preprocessing, matlab, python, NCL for plots
Easily deployed, make it easy to scale up an analysis pipeline from laptop to supercomputer, the dashboard
Memory issues
Hi Matt,
This picture shows the gradient of surface Temperature and how mesoscale eddies enhance the gradient (blue is strong gradient, white is weak). This was actually done with dask hehe.
Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.
This was part of my PhD at Columbia University in collaboration with @rabernat
A range of data sizes from your domain
I am using a large variety of datasets, from Megabytes (Ship sections), up to ~PB scale datasets like CMIP6 (which I am working on currently). I am using dask for all but the smallest datasets, usually in combination with xarray.
What did people use before Pangeo
I used mostly MATLAB. The moment I decided to completely switch to python was when @rabernat showed me how to use dask's map_blocks
, which made my analysis run in 17 min instead of hours!
Favorite feature of Dask
Hands down the integration with xarray. The fact that I can develop analyses on small datasets and usually just scale them up to medium sized datasets (GB-TB) without much hassle is amazing.
Biggest pain point of Dask
In my current work with CMIP6 I am struggling a lot with problems related to memory overload/killed workers. For now these break the 'scalability' with very large datasets (like many models with several members), and I have to work around these by e.g. precomputing single variables/members or other subsets of the data.
That image is beautiful
On Fri, May 22, 2020 at 10:17 AM Julius Busecke notifications@github.com wrote:
Hi Matt,
[image: Visualisation_ScienceArt-JBusecke] https://user-images.githubusercontent.com/14314623/82692038-f5874b00-9c2c-11ea-963e-513daa64eb7d.png This picture shows the gradient of surface Temperature and how mesoscale eddies enhance the gradient (blue is strong gradient, white is weak). This was actually done with dask hehe.
Names of institutions which participate or use the project. If you happen to have logo images around that would also be welcome, but I can hunt those down as well.
This was part of my PhD at Columbia University in collaboration with @rabernat https://github.com/jbusecke/busecke_abernathey_2019_sciadv
A range of data sizes from your domain
I am using a large variety of datasets, from Megabytes (Ship sections), up to ~PB scale datasets like CMIP6 (which I am working on currently). I am using dask for all but the smallest datasets, usually in combination with xarray.
What did people use before Pangeo
I used mostly MATLAB. The moment I decided to completely switch to python was when @rabernat https://github.com/rabernat showed me how to use dask's map_blocks, which made my analysis run in 17 min instead of hours!
Favorite feature of Dask
Hands down the integration with xarray. The fact that I can develop analyses on small datasets and usually just scale them up to medium sized datasets (GB-TB) without much hassle is amazing.
Biggest pain point of Dask
In my current work with CMIP6 I am struggling a lot with problems related to memory overload/killed workers https://github.com/dask/distributed/issues/2602. For now these break the 'scalability' with very large datasets (like many models with several members), and I have to work around these by e.g. precomputing single variables/members or other subsets of the data.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/778#issuecomment-632814420, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFDUCJL3THYNAFRANTRS2XUNANCNFSM4NEOZMVQ .
I'm not sure whether you are adressing Pangeo "core" members or not with this post.
@apatlpo - No such thing really exists. 😄
Rhodium Group (logo) and the Climate Impact Lab (logo)
This figure is from Hsiang et al (2017) and shows the county level median values for average 2080-2099 RCP 8.5 impacts, relative to “no additional climate change” trajectories. This was not done with dask but is an example of the type of work that we do.
Our data ranges from gigabytes to hundreds of terabytes.
Before pangeo (in this context, really Kubernetes and google cloud), we used supercomputing resources available through the University of California Berkeley (one of the Climate Impact Lab partners). Lots of slurm batch jobs. For post-processing and analysis, NCO, CDO, and other similar tools.
We use dask (and kubernetes) very extensively for our computing, and it allows for flexible, scalable computing environments that are compatible across our various workflows, which vary substantially between projects and client needs. Much of our analysis on very large datasets would not be possible without dask and other tools that are part of the pangeo stack.
One of our biggest pain points is figuring out the best way to make jobs sufficiently small to not encounter memory issues while also not overwhelming the scheduler.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.
Hi Everyone,
I'm putting together a "Who uses Dask?" presentation for AnacondaCon this year and would like to include a Pangeo section.
If people here are willing, I'd love to get the following information from a few folks:
No pressure, but if people have these things lying around and can send them in the next couple of days I think that it would be great to include them in the talk. I have to, compile, record, and submit a recording by May 25th.
Best, -matt