pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
699 stars 189 forks source link

Lessons learned / action items following the European Pangeo Meeting? #646

Closed willirath closed 5 years ago

willirath commented 5 years ago

Sadly, we did not (or did we?) keep minutes as we went. So this is an attempt to collect possible points to follow up on that have arisen during the meeting in Paris.

OT: Many thanks to @guillaumeeb, @lesteve et al. for the organisation.

arokem commented 5 years ago

I'm curious about this one:

Explicitly creating spectra via FFT not necessary when all you're interested in is an integral diagnostic like kinetic energy broken down to only a few frequency bands.

Is this also true for bivariate time-series measures (e.g., coherence in particular spectral bands)? Do you have some good pointers/examples?

guillaumeeb commented 5 years ago

I want to thank @apatlpo also for the organization!

I think him and @lesommer tried to take some notes of the exchanges. We need to work together on some blog post summarizing the key points of our discussion. Here is what I can come up with as a first compilation:

European users (especially France, and probably except UK) are more dependant of HPC systems

This is probably due to the fact that major cloud provider are US based, we don't want them to take our data, but also we don't have organization big enough at country level (like NASA for instance) to get an interesting partnership.

This leads to some actions we need to take:

Ifremer with the help of CNES and CLS has made some nice experiment on HPC platform

Despite our easier access to HPC systems, we need to take advantage of european DIAS to show the power of Pangeo

We need to share and define some optimal ways of doing distributed stuf

Contribute, contribute, contribute !

Men, the Met Office's Informatic lab talk was awesome!

They're just doing extremly cool things with the cloud and Pangeo, we've got a lot to learn from them! 👏 @jacobtomlinson and @DPeterK!

mrocklin commented 5 years ago

Thanks for the writeup @guillaumeeb !

[Zarr] makes it also really performant on an HPC system, an order of magnitude faster than NetCDF.

I am surprised by this. I would be curious to learn more about the challenges of HDF5/NetCDF on HPC. I would have expected it to be well tuned. I wonder if there are things we can do to make HDF5/NetCDF more efficient when used with Dask/Xarray/Iris.

Improving dask-jobqueue ease of use as @willirath mentionned. So both the library and how to use it.

I am curious about the kinds of improvements that this group would find most valuable.

guillaumeeb commented 5 years ago

For NetCDF vs Zarr part, here is how I see it, provided I'm really not an expert in NetCDF file format :

In the end, I consider NetCDF as a good archive format, or a good format if you are only interested in a few gigabytes of data that are stored in one file and you want to work on it from a personal computer.

On the contrary, Zarr has been designed with object store and independant chunks in mind, so it is perfect for intensive parallel reads and writes on any storage infrastructure. Just beware of the chunksize!

For dask-jobqueue, I'm biased, as I am part of the maintainers. It would be better if a more scientific person using it tried to answer. But here is what I heard:

guillaumeeb commented 5 years ago

I've put my first comment on a Medium page, just so that we can work on it, but I'm not sure this is the best way.

https://medium.com/@g.eynard.bontemps/the-2019-european-pangeo-meeting-437170263123

May be we should work on hackmd.io?

fbriol commented 5 years ago

During my research on the ZARR format, I found two interesting sources. The first "https://github.com/HumanCellAtlas/table-testing" is a repository containing sources for measuring the performance of different file formats: HDF5, Parquet, Zarr and others. These different tests were used to select the best storage candidate by HumanCellAltas, which chose Zarr. Raw results are available in the subdirectories of the tasks directory.

An article on the blog "http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html" lists different tests on Zarr's compression filters: compression ratio, speed, effects of chunks size on compression, etc. A very useful article to choose the right compression filter.

tinaok commented 5 years ago

Hi, here is what we ( @apatlpo ) have as feed back from IFREMER side, as a basis for creating a kind of 'guid line' to have a good use of Pangeo on HPC systems. 3 potential important aspects to consider in HPC for pangeo

  1. filesystem constraints:

    • what filesystem am I on (e.g.)? GPFS, Lustre, NFS, ...
    • chunk size (default = dask.config.get(‘array.chunk-size’) = 128MiB): what is the optimal file size for this filesystem (you need to ask admins about this)? Could users run a benchmark to know about this?
    • Use of user-defined filesystem option to improve the performance ? for example Lustre striping should be 1 (lfs setstripe -c 1 zarrfile) for zarr directories.
  2. CPU time and communication time loss

    • optimal chunk size (easy to modify)
    • optimal chunk order in rechunking (maybe less easy to modify, depends on calculation and potentially internal dask mechanisms)
    • optimal number of threads per dask processes to maximise the usage of each HPC node
  3. Memory load:

    • depends on chunk size
    • existence and size of a local temporary storage (spill in ./config/dask/distributed.yaml)
    • will I be able to do everything in memory or should I manually create temporary zarr files?
guillaumeeb commented 5 years ago

Hi all, I've tried to update my first medium draft with all the things said here.

Please give me some feedback, I feel this is close to publication.

cc @fbriol @apatlpo @lesommer @jacobtomlinson @DPeterK @tinaok and others that were not there @rabernat @mrocklin.

guillaumeeb commented 5 years ago

Thanks @willirath and @DPeterK for rereading. I've published the post, but I'm not sure how to make it appear on pangeo medium namespace, can yo do it @rabernat?

fbriol commented 5 years ago

News about the convergence between netCDF and Zarr here : https://www.unidata.ucar.edu/blogs/news/entry/netcdf-and-native-cloud-storage

pl-marasco commented 5 years ago

@guillaumeeb Have you any news on this?

We've already started to talk (thanks to ECMWF) with Eumetsat "institutional" DIAS. They seem to be eager to work with us, so we will need to find some bandwidth. I'm personally also talking with Onda DIAS.

If it's possible I would like to know the "state of the art" of this discussion.

guillaumeeb commented 5 years ago

About ECMWF Wekeo DIAS, see this comment.

About Onda DIAS, I'm currently in touch with OVH Cloud provider, I'm hopping to start something in September or October. It seems OVH offering has everything Pangeo needs, and Jupyter folks recently started to use it for MyBinder. I'll put more information here as soon as I have some!