collaboration? - Githubissues

martindurant commented 3 years ago

Hi there! I recently heard about this project via a colleague watching your presentation at ESIP.

I am the lead developer of fsspec and the fsspec-reference-maker. You might find the following articles interesting:

After only a brief perusal of this repo, the following comparisons come to mind.

Things we can learn from you (there are probably more!)

combining adjacent reads (like corot project); this is tractable, but less important given that our reads are concurrent
caching redirects from auth mechanism

Things we do that you might find interesting in our project

combining many datasets into singe ensemble datasets
not just HDF5, but also grib2 and Tiff and more to come
any storage backend supported by fsspec (s3, gcs, azure, http, ftp, ..., see here); references and data can live separately. Could be made to work with data living on different storage backends but part of a single aggregate data set.
concurrent access to chunks for S3, Http, GCS, abfs
explicit ability to be serialised for distributed processing with dask
requires no extra installs aside from standard xarray, zarr, fsspec
doesn't necessarily need xarray; is intended to be multi-language
json, parquet or zarr storage of the reference metadata files

Let's work together and not invent more wheels!

bilts commented 3 years ago

Hi, Martin! Sorry I'm late getting back to you and thanks for chiming in.

I've been keeping an eye on fsspec-reference-maker which I saw at ESIP as well and trying to figure out what to do with it. My strong preference is to just use something standard vs something EOSDIS-specific, so it has me really excited.

Right now we already generate DMR++ for OPeNDAP support, so one of the purposes of this work was to take what we're already doing and build on it, rather than getting new metadata generated. The latter is possible, but trickier.

To that end, how stable is the fsspec format for storing chunk offsets, and how is it governed? (i.e. how often would we need to re-generate the metadata once we've done it once)

Thanks!

martindurant commented 3 years ago

To answer your questions

how stable is the fsspec format

Our intent is to make it fully backward compatible, adding only new features. In the readme, you will see that we already had a Version 0 (before the spec was written down) and Version 1.

how often would we need to re-generate the metadata once we've done it once

If the data does not change, you do not need to change the metadata. A common patter, though, might be to generate metadata for individual files and save them, but then, later, create various aggregated views of these as requirements change. This would be relatively cheap. Also, since the metadata is fairly simple JSON, it could be readily edited if, for example, the file path naming of the originals were to change.

An additional note on my point previously

combining adjacent reads (like corot project); this is tractable, but less important given that our reads are concurrent

In a separate problem around fetching exactly those bytes ranges of a parquet dataset that will actually be required to get the data requested, we are facing this problem (independently). Any algorithm you have to consider a bunch of byte range and combine them on some heuristic (overlaps, gaps, expected latency), and then re-extract the ranges after fetch - this would be massively appreciated!

bilts commented 2 years ago

We have three methods that do detection of byte ranges, merging adjacent ranges within n bytes of each other, and splitting bytes after retrieval: https://github.com/nasa/zarr-eosdis-store/blob/main/eosdis_store/stores.py#L170-L247

I'm sure they could be optimized more, but they've worked for the purpose of this library.

martindurant commented 2 years ago

An update on our end, we have the following bytes-range merge code in fsspec: https://github.com/fsspec/filesystem_spec/blob/642e94aac03b4fec9d438e32f5988bbf4d292184/fsspec/utils.py#L488 (being used only by the parquet route at the moment).

I'll have a look at the code you link to, to see if it can be adapted to our use case (cc @rjzamora )

nasa / zarr-eosdis-store

collaboration? #9