metadata-only outputs - Githubissues

martindurant commented 3 years ago

The typical workflow discussed in most other issues is around taking some dataset, and transforming it into a zarr format for storage - with a number of options about how to go about doing that.

Here I want to make note of an alternative, where the final data product will still be loading from the original source, and the output of the pangeo-forge process is a prescription for how to go about it. The principal use cases for this are:

data sets that are constantly changing, and so would need to be run repeatedly through the forge with "append" if it were to be made into a single dataset
the original data is too big to be replicated, and most analysis would require only small sections of it
making different cuts or views of some very large data set
information is encoded in the file naming convention of the original

There are two broad categories of data access considered, for now

loading binary chunks from the target. This is the idea behind fsspec-reference-maker, whereby we find binary blocks within some dataset, and assign zarr chunk keys to each. The example included is for a single HDF5 file; but the idea could be extended to many files and many formats, so long as the the codec is implemented in (or easily added to) numcodecs. This works well for the domain where the original binary chunks are large enough. The downside is having to extract the byte offsets from the original, requiring a complete read. Once scanned, the original dataset becomes available to zarr directly, without need for the libraries that did the scanning.
loading whole files. This is the idea behind intake_informaticslab. In this case, the files are loaded using the library appropriate for the original data format, which might require temporary local storage (e.g., for grib2, where the code is in C and needs a file handle). However a set of data files is still expressed as zarr, one file per chunk, with a custom zarr storage driver. In the examples, the mapping of zarr keys to file locations is specific to each dataset and dependent on the naming conventions. This access pattern requires an understanding of the target file layout, but no scanning of the files. It does need distribution of the zarr storage layer to do the key mapping and temporary file download - but the filename mapping could be expressed declaratively rather than in code. This pattern would be required for the case that binary blocks in the original cannot be decoded directly, or that there are many small blocks in each file, so direct access would be very inefficient.

rabernat commented 3 years ago

I think this is a great pattern we should definitely work to support! 👍 These recipes will generally be a bit cheaper to run because they don't have to copy much data.

Please feel free to take a stab at implementing such a recipe class. It would be good to have an issue in staged-recipes to point to a specific dataset we can use as a user story.

martindurant commented 3 years ago

cc @tam203 - you might be interested eventually encoding your datasets into pangeo-forge recipes or, more simply, including your existing catalogue prescriptions. I have not yet had the chance to look through the code of Hypothetic in detail, to have a good model for myself of the components (filename convention versus zarr chunk key; zarr storage; intake driver; download/cache layer).

@rabernat The simplest case to encode would be the existing example in https://github.com/intake/fsspec-reference-maker/blob/main/examples/intake_catalog.yml , and the reference file specified therein. The recipe would essentially repeat that scan for the latest capabilities of fsspec-reference-maker as it evolves.

pangeo-forge / pangeo-forge-recipes

metadata-only outputs #70