There's a concept of general N-D labeled datasets which is growing more and more within the community. As a user of the datasets from the HALO-DB, I want to access the data as an N-D labeled dataset. The obtained datasets should follow a widely accepted convention (e.g. CF-Conventions).
Note on N-D labeled datasets. The basic idea is to collect a couple of multidimensional arrays into one dataset. The dimensions of the arrays are labelled and can are shared between arrays within a dataset. Datasets and arrays additionally carry attributes which provide more information. Conventions like CF Conventions provide more information about how a dataset should be interpreted.
N-D labeled datasets show up in various forms:
A long-known form is netCDF, which started out primarily as storage format, but later on transformed itself more to a frontend to other storage formats (currently, netCDF supports HDF5, opendap and is working on support for zarr)
HDF5 does not fit exactly to this concept of N-D labeled datasets as i.e. dimensions are not shared between arrays, but as netCDF4 is using it as backend (in a reduced form), it is probably worth mentioning
opendap is a way of accessing N-D labeled datasets over HTTP. It does so by offering an index of the datasets in a short textual representations and specifies how a user who knows the index can ask the server for any subsets of the actual data within the dataset
zarr is a storage format which is based on storing the index of the contained dataset in JSON-files which are separated from the data files. Data can be stored in chunks such that a user who knows the index may decide to access only some of the chunk files
xarray is a python library which handles N-D labeled datasets within memory and provides a unified access to all the mentioned backends
iris is another python library which is similar to xarray but focusses more on handling CF-Conventions as well.
Note that CF-Conventions may be applied equally well to all of those formats. There are small and sometimes subtle differences between the various forms, but most datasets can be converted without any loss between those variants. And there's value in crafting a dataset deliberately in a way which allows transformations between these formats as the serve quite distinct purposes.
netCDF (with either classic or HDF5 backend) is good for storing datasets (at least for short- and mid-term storage) and moving it around entirely. It is just a single file. It can also be modified relatively easily
opendap only covers the transport of data, so every opendap implementation needs to resort to another form for data at rest. The main advantage of opendap is that a user of the dataset can communicate exactly the required subset of the data back to the server, such that only the required parts of the original dataset have to be transferred. As the format of data at rest is not specified, the server can be backed by almost everything and it is event possible to create a server which computes data (subsets) on demand.
zarr naturally will be stored as a folder structure with many files (but can also be packed in e.g. a zip file). Thus it good for data at rest but not so much for sending it around in its entirety. The main advantage of zarr is that it suits the concept of cloud storage very well. All the single files can be served separately over HTTP and a client can choose which chunks are relevant for the subset to be analyzed. However, as It's just a bunch of files, a server can be built using commonly used web servers which dramatically improves access times The downside of zarr is that data must be stored into this format in advance.
Client libraries are particularly important, as a user usually doesn't care which of the storage variants are served. The fact that netCDF plans to support all variants will help in this regard.
There's a concept of general N-D labeled datasets which is growing more and more within the community. As a user of the datasets from the HALO-DB, I want to access the data as an N-D labeled dataset. The obtained datasets should follow a widely accepted convention (e.g. CF-Conventions).
Note on N-D labeled datasets. The basic idea is to collect a couple of multidimensional arrays into one dataset. The dimensions of the arrays are labelled and can are shared between arrays within a dataset. Datasets and arrays additionally carry attributes which provide more information. Conventions like CF Conventions provide more information about how a dataset should be interpreted.
N-D labeled datasets show up in various forms:
Note that CF-Conventions may be applied equally well to all of those formats. There are small and sometimes subtle differences between the various forms, but most datasets can be converted without any loss between those variants. And there's value in crafting a dataset deliberately in a way which allows transformations between these formats as the serve quite distinct purposes.
Client libraries are particularly important, as a user usually doesn't care which of the storage variants are served. The fact that netCDF plans to support all variants will help in this regard.