Lessons learned / action items following the European Pangeo Meeting?

willirath commented 5 years ago

Sadly, we did not (or did we?) keep minutes as we went. So this is an attempt to collect possible points to follow up on that have arisen during the meeting in Paris.

Dask-Jobqueue related stuff from the sprint today.
We've talked a lot about the fact that the historically grown ways of doing stuff often are not ideal for parallel computing. Examples:
- Explicitly creating spectra via FFT not necessary when all you're interested in is an integral diagnostic like kinetic energy broken down to only a few frequency bands.
- Explicit regridding to sigma coords not necessary when all you want is transport for all water denser than [27.2, 27.1, 27.0, ...].
- …
We've touched on the question of how to best configure netCDF4 (chunking, deflation, record dims, classic vs. full nc4, shuffling, ...) for performance. My takeaway is: There's a lot you can do wrong but we might not know the optimum yet.
Zarr vs. netCDF on HPC: There might be a lot of distributed knowledge in Pangeo. Somebody should write this down™.
Please add more items?

OT: Many thanks to @guillaumeeb, @lesteve et al. for the organisation.

arokem commented 5 years ago

I'm curious about this one:

Explicitly creating spectra via FFT not necessary when all you're interested in is an integral diagnostic like kinetic energy broken down to only a few frequency bands.

Is this also true for bivariate time-series measures (e.g., coherence in particular spectral bands)? Do you have some good pointers/examples?

guillaumeeb commented 5 years ago

I want to thank @apatlpo also for the organization!

I think him and @lesommer tried to take some notes of the exchanges. We need to work together on some blog post summarizing the key points of our discussion. Here is what I can come up with as a first compilation:

European users (especially France, and probably except UK) are more dependant of HPC systems

This is probably due to the fact that major cloud provider are US based, we don't want them to take our data, but also we don't have organization big enough at country level (like NASA for instance) to get an interesting partnership.

This leads to some actions we need to take:

Improving dask-jobqueue ease of use as @willirath mentionned. So both the library and how to use it.
Focus on Pangeo on HPC guidelines: how to best tune it, for instance Zarr chunking, worker spilling, understanding errors...
We need to reach to Big HPC system admins to evangelize Pangeo and make them install it. Julien is in discussion with CINES for OCCIGEN. See also #614 for a good opportunity.
And what about the Binder launching an HPC setup for HPC tutorial on the cloud as proposed by @lesteve, https://github.com/dask/dask-jobqueue/issues/276.

Ifremer with the help of CNES and CLS has made some nice experiment on HPC platform

We need to share this to the community (again, give best practices, but also some results and lessons learned)
We need to complete the work with a Zarr vs NetCDF benchmark on HPC. One of the principal point of the experiment, is that what makes Zarr performant in an object store, makes it also really performant on an HPC system, an order of magnitude faster than NetCDF.

Despite our easier access to HPC systems, we need to take advantage of european DIAS to show the power of Pangeo

We've already started to talk (thanks to ECMWF) with Eumetsat "institutional" DIAS. They seem to be eager to work with us, so we will need to find some bandwidth.
I'm personnaly also talking with Onda DIAS.

We need to share and define some optimal ways of doing distributed stuf

All the points mentionned by @willirath on this,
How to chunk data, and not split the computation/logic part. Make data chunk processing independant, with as less communication needed as possible.
We'll need the real scientific use case presented thursday for providing ways of doing things.
Interpolation problem @briolf.
How to interface efficently and at the right level with C code?

Contribute, contribute, contribute !

Beware of doing the same thing as your neighbor,
If you have a problem, just share it.
If you have a solution, just share it!
Can we work on domain libraries? Which ones?
We need people to be doing presentation/tutorial. Or event tutorial on how to do a tutorial, as proposed by @DPeterK!

Men, the Met Office's Informatic lab talk was awesome!

They're just doing extremly cool things with the cloud and Pangeo, we've got a lot to learn from them! 👏 @jacobtomlinson and @DPeterK!

mrocklin commented 5 years ago

Thanks for the writeup @guillaumeeb !

[Zarr] makes it also really performant on an HPC system, an order of magnitude faster than NetCDF.

I am surprised by this. I would be curious to learn more about the challenges of HDF5/NetCDF on HPC. I would have expected it to be well tuned. I wonder if there are things we can do to make HDF5/NetCDF more efficient when used with Dask/Xarray/Iris.

Improving dask-jobqueue ease of use as @willirath mentionned. So both the library and how to use it.

I am curious about the kinds of improvements that this group would find most valuable.

guillaumeeb commented 5 years ago

For NetCDF vs Zarr part, here is how I see it, provided I'm really not an expert in NetCDF file format :

NetCDF is often a burden for our HPC system:
- It is often used as a kind of file database, meaning people computing something along a satellite swath will issue a file open, lookup and read at each step on some NetCDF file with values along a grid. This leads to tons of IO operations on our file system.
- Even when used as a standard input/output format, it often means a lot more (small) IOs that I would think is needed. Sometimes this may be due to how variables are stored, or to old NetCDF versions, or to writing strategy, I don't know. I admit that when doing simple tests with simple dimensions, it is not that bad...
NetCDF writing with Xarray and Dask is not performant:
- Single threaded write is mandatory, locks are used anyway. I don't know a way to do it in parallel (but I really didn't look hard or ask the question).
- But even for a single threaded read or write, the bandwidth I've seen is really low. I think this stays true even without Xarray.
Proposed compression (deflate) is not really performant for data analysis.
I know that when using the right tool (MPI-IO + Parralel NetCDF), you can get really good performances. I've never personaly tried it, but I think this is too complicated for a common scientist.

In the end, I consider NetCDF as a good archive format, or a good format if you are only interested in a few gigabytes of data that are stored in one file and you want to work on it from a personal computer.

On the contrary, Zarr has been designed with object store and independant chunks in mind, so it is perfect for intensive parallel reads and writes on any storage infrastructure. Just beware of the chunksize!

For dask-jobqueue, I'm biased, as I am part of the maintainers. It would be better if a more scientific person using it tried to answer. But here is what I heard:

Users don't know how to configure it on their system (cores, memory, other params). To that I tend to answer RTFM, and share your configuration once you've got something working. But this probably means we need to improve things too :).
We've got a few issues open for a long time, like the correct use of as_completed, the complexity of handling worker reaching walltime, or the recurrent problem of how to use multiple cores for a single task?
Adaptive makes people afraid, maybe we should add more examples on tunning it...
We lack jobqueue examples (I know that they would be the same that binderized repo ones), or a way to easily teach jobqueue without having a HPC cluster at hand.
Not jobqueue related, but the way Dask leverages parallelization by streaming tasks is just hard to get for people used to the MPI way of doing things (and I have to admit, I learned distributed computing with Hadoop mainly, and the MPI way of doing it is sometimes hard to get to me).

guillaumeeb commented 5 years ago

I've put my first comment on a Medium page, just so that we can work on it, but I'm not sure this is the best way.

https://medium.com/@g.eynard.bontemps/the-2019-european-pangeo-meeting-437170263123

May be we should work on hackmd.io?

fbriol commented 5 years ago

During my research on the ZARR format, I found two interesting sources. The first "https://github.com/HumanCellAtlas/table-testing" is a repository containing sources for measuring the performance of different file formats: HDF5, Parquet, Zarr and others. These different tests were used to select the best storage candidate by HumanCellAltas, which chose Zarr. Raw results are available in the subdirectories of the tasks directory.

An article on the blog "http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html" lists different tests on Zarr's compression filters: compression ratio, speed, effects of chunks size on compression, etc. A very useful article to choose the right compression filter.

tinaok commented 5 years ago

Hi, here is what we ( @apatlpo ) have as feed back from IFREMER side, as a basis for creating a kind of 'guid line' to have a good use of Pangeo on HPC systems. 3 potential important aspects to consider in HPC for pangeo

filesystem constraints:
- what filesystem am I on (e.g.)? GPFS, Lustre, NFS, ...
- chunk size (default = dask.config.get(‘array.chunk-size’) = 128MiB): what is the optimal file size for this filesystem (you need to ask admins about this)? Could users run a benchmark to know about this?
- Use of user-defined filesystem option to improve the performance ? for example Lustre striping should be 1 (lfs setstripe -c 1 zarrfile) for zarr directories.
CPU time and communication time loss
- optimal chunk size (easy to modify)
- optimal chunk order in rechunking (maybe less easy to modify, depends on calculation and potentially internal dask mechanisms)
- optimal number of threads per dask processes to maximise the usage of each HPC node
Memory load:
- depends on chunk size
- existence and size of a local temporary storage (spill in ./config/dask/distributed.yaml)
- will I be able to do everything in memory or should I manually create temporary zarr files?

guillaumeeb commented 5 years ago

Hi all, I've tried to update my first medium draft with all the things said here.

Please give me some feedback, I feel this is close to publication.

cc @fbriol @apatlpo @lesommer @jacobtomlinson @DPeterK @tinaok and others that were not there @rabernat @mrocklin.

guillaumeeb commented 5 years ago

Thanks @willirath and @DPeterK for rereading. I've published the post, but I'm not sure how to make it appear on pangeo medium namespace, can yo do it @rabernat?

fbriol commented 5 years ago

News about the convergence between netCDF and Zarr here : https://www.unidata.ucar.edu/blogs/news/entry/netcdf-and-native-cloud-storage

pl-marasco commented 5 years ago

@guillaumeeb Have you any news on this?

We've already started to talk (thanks to ECMWF) with Eumetsat "institutional" DIAS. They seem to be eager to work with us, so we will need to find some bandwidth. I'm personally also talking with Onda DIAS.

If it's possible I would like to know the "state of the art" of this discussion.

guillaumeeb commented 5 years ago

About ECMWF Wekeo DIAS, see this comment.

About Onda DIAS, I'm currently in touch with OVH Cloud provider, I'm hopping to start something in September or October. It seems OVH offering has everything Pangeo needs, and Jupyter folks recently started to use it for MyBinder. I'll put more information here as soon as I have some!

pangeo-data / pangeo