Closed willirath closed 5 years ago
I'm curious about this one:
Explicitly creating spectra via FFT not necessary when all you're interested in is an integral diagnostic like kinetic energy broken down to only a few frequency bands.
Is this also true for bivariate time-series measures (e.g., coherence in particular spectral bands)? Do you have some good pointers/examples?
I want to thank @apatlpo also for the organization!
I think him and @lesommer tried to take some notes of the exchanges. We need to work together on some blog post summarizing the key points of our discussion. Here is what I can come up with as a first compilation:
This is probably due to the fact that major cloud provider are US based, we don't want them to take our data, but also we don't have organization big enough at country level (like NASA for instance) to get an interesting partnership.
This leads to some actions we need to take:
They're just doing extremly cool things with the cloud and Pangeo, we've got a lot to learn from them! 👏 @jacobtomlinson and @DPeterK!
Thanks for the writeup @guillaumeeb !
[Zarr] makes it also really performant on an HPC system, an order of magnitude faster than NetCDF.
I am surprised by this. I would be curious to learn more about the challenges of HDF5/NetCDF on HPC. I would have expected it to be well tuned. I wonder if there are things we can do to make HDF5/NetCDF more efficient when used with Dask/Xarray/Iris.
Improving dask-jobqueue ease of use as @willirath mentionned. So both the library and how to use it.
I am curious about the kinds of improvements that this group would find most valuable.
For NetCDF vs Zarr part, here is how I see it, provided I'm really not an expert in NetCDF file format :
In the end, I consider NetCDF as a good archive format, or a good format if you are only interested in a few gigabytes of data that are stored in one file and you want to work on it from a personal computer.
On the contrary, Zarr has been designed with object store and independant chunks in mind, so it is perfect for intensive parallel reads and writes on any storage infrastructure. Just beware of the chunksize!
For dask-jobqueue, I'm biased, as I am part of the maintainers. It would be better if a more scientific person using it tried to answer. But here is what I heard:
I've put my first comment on a Medium page, just so that we can work on it, but I'm not sure this is the best way.
https://medium.com/@g.eynard.bontemps/the-2019-european-pangeo-meeting-437170263123
May be we should work on hackmd.io?
During my research on the ZARR format, I found two interesting sources. The first "https://github.com/HumanCellAtlas/table-testing" is a repository containing sources for measuring the performance of different file formats: HDF5, Parquet, Zarr and others. These different tests were used to select the best storage candidate by HumanCellAltas, which chose Zarr. Raw results are available in the subdirectories of the tasks directory.
An article on the blog "http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html" lists different tests on Zarr's compression filters: compression ratio, speed, effects of chunks size on compression, etc. A very useful article to choose the right compression filter.
Hi, here is what we ( @apatlpo ) have as feed back from IFREMER side, as a basis for creating a kind of 'guid line' to have a good use of Pangeo on HPC systems. 3 potential important aspects to consider in HPC for pangeo
filesystem constraints:
lfs setstripe -c 1 zarrfile
) for zarr directories.CPU time and communication time loss
Memory load:
Hi all, I've tried to update my first medium draft with all the things said here.
Please give me some feedback, I feel this is close to publication.
cc @fbriol @apatlpo @lesommer @jacobtomlinson @DPeterK @tinaok and others that were not there @rabernat @mrocklin.
Thanks @willirath and @DPeterK for rereading. I've published the post, but I'm not sure how to make it appear on pangeo medium namespace, can yo do it @rabernat?
News about the convergence between netCDF and Zarr here : https://www.unidata.ucar.edu/blogs/news/entry/netcdf-and-native-cloud-storage
@guillaumeeb Have you any news on this?
We've already started to talk (thanks to ECMWF) with Eumetsat "institutional" DIAS. They seem to be eager to work with us, so we will need to find some bandwidth. I'm personally also talking with Onda DIAS.
If it's possible I would like to know the "state of the art" of this discussion.
About ECMWF Wekeo DIAS, see this comment.
About Onda DIAS, I'm currently in touch with OVH Cloud provider, I'm hopping to start something in September or October. It seems OVH offering has everything Pangeo needs, and Jupyter folks recently started to use it for MyBinder. I'll put more information here as soon as I have some!
Sadly, we did not (or did we?) keep minutes as we went. So this is an attempt to collect possible points to follow up on that have arisen during the meeting in Paris.
Dask-Jobqueue related stuff from the sprint today.
We've talked a lot about the fact that the historically grown ways of doing stuff often are not ideal for parallel computing. Examples:
[27.2, 27.1, 27.0, ...]
.We've touched on the question of how to best configure netCDF4 (chunking, deflation, record dims, classic vs. full nc4, shuffling, ...) for performance. My takeaway is: There's a lot you can do wrong but we might not know the optimum yet.
Zarr vs. netCDF on HPC: There might be a lot of distributed knowledge in Pangeo. Somebody should write this down™.
Please add more items?