nasa / zarr-eosdis-store

Zarr data store for efficiently accessing NetCDF4 data from NASA’s Earth observing system in the cloud using the Zarr Python library
Other
18 stars 6 forks source link

collaboration? #9

Open martindurant opened 3 years ago

martindurant commented 3 years ago

Hi there! I recently heard about this project via a colleague watching your presentation at ESIP.

I am the lead developer of fsspec and the fsspec-reference-maker. You might find the following articles interesting:

After only a brief perusal of this repo, the following comparisons come to mind.

Things we can learn from you (there are probably more!)

Things we do that you might find interesting in our project

Let's work together and not invent more wheels!

bilts commented 3 years ago

Hi, Martin! Sorry I'm late getting back to you and thanks for chiming in.

I've been keeping an eye on fsspec-reference-maker which I saw at ESIP as well and trying to figure out what to do with it. My strong preference is to just use something standard vs something EOSDIS-specific, so it has me really excited.

Right now we already generate DMR++ for OPeNDAP support, so one of the purposes of this work was to take what we're already doing and build on it, rather than getting new metadata generated. The latter is possible, but trickier.

To that end, how stable is the fsspec format for storing chunk offsets, and how is it governed? (i.e. how often would we need to re-generate the metadata once we've done it once)

Thanks!

martindurant commented 3 years ago

To answer your questions

how stable is the fsspec format

Our intent is to make it fully backward compatible, adding only new features. In the readme, you will see that we already had a Version 0 (before the spec was written down) and Version 1.

how often would we need to re-generate the metadata once we've done it once

If the data does not change, you do not need to change the metadata. A common patter, though, might be to generate metadata for individual files and save them, but then, later, create various aggregated views of these as requirements change. This would be relatively cheap. Also, since the metadata is fairly simple JSON, it could be readily edited if, for example, the file path naming of the originals were to change.

An additional note on my point previously

combining adjacent reads (like corot project); this is tractable, but less important given that our reads are concurrent

In a separate problem around fetching exactly those bytes ranges of a parquet dataset that will actually be required to get the data requested, we are facing this problem (independently). Any algorithm you have to consider a bunch of byte range and combine them on some heuristic (overlaps, gaps, expected latency), and then re-extract the ranges after fetch - this would be massively appreciated!

bilts commented 2 years ago

We have three methods that do detection of byte ranges, merging adjacent ranges within n bytes of each other, and splitting bytes after retrieval: https://github.com/nasa/zarr-eosdis-store/blob/main/eosdis_store/stores.py#L170-L247

I'm sure they could be optimized more, but they've worked for the purpose of this library.

martindurant commented 2 years ago

An update on our end, we have the following bytes-range merge code in fsspec: https://github.com/fsspec/filesystem_spec/blob/642e94aac03b4fec9d438e32f5988bbf4d292184/fsspec/utils.py#L488 (being used only by the parquet route at the moment).

I'll have a look at the code you link to, to see if it can be adapted to our use case (cc @rjzamora )