Enriching datacube STAC items with more array metadata

ghidalgo3 commented 3 months ago

Hello, I am interested in expanding the datacube STAC extension to support more multidimensional array metadata for assets, particularly array metadata found in NetCDF, HDF5, and GRIB2 files. I think I'm caught up on the great discussions of the past:

And the STAC catalog items that I've been working with are all hosted on Microsoft's Planetary Computer platform, specifically:

For context, my goal is to be one day be able to do something like this with Xarray:

>>> import xarray as xr
>>> items = stac_catalog.search(...)
>>> vds = xr.open_mfdataset(items, engine="stac") # This call should do no I/O!
>>> vds
<xarray.Dataset> Size: 1GB
Dimensions:  (lat: 600, lon: 1440, time: 360)
Coordinates:
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
  * time     (time) float64 3kB 3.6e+04 3.6e+04 3.6e+04 ... 3.636e+04 3.636e+04
Data variables:
    pr       (time, lat, lon) float32 1GB ManifestArray<shape=(360, 600, 1440...
>>> vds.variable.sum() # Here the IO runs 
42.0

In that example, assume that STAC items returned in the search contain assets which are the files themselves. I don't want to actually read the asset, I want the STAC item to contain enough information to create a manipulable dataset that Xarray understands. Reading comes after searching, merging, filtering, and projecting away the variables I'm not interested in.

This proposal is heavily based on ZarrV3 though I believe any multidimensional array handling system will care to know the same information.

I propose the following additional properties on only cube:variables:

Field Name	Type	Description
`data_type`	`string`	`numpy` parseable datatype
`chunk_shape`	`[number]`	The size of a chunk by element count
`fill_value`	`number\\|string\\|null`	Needed to handle sparse arrays
(optional) `dimensions`	`[string]`	The subset of `cube:dimensions` that index this variable. If not set, all dimensions index this variable. This may happen with single GRIB2 files that contain multiple datacubes.
`codecs`	`[object]`	An ordered list of codec configurations

A new property that applies to either cube:variables or cube:dimensions:

Field Name	Type	Description
`attrs`	`object`	Lifted key-value attributes from the original source file.

In the previous discussion on this topic https://github.com/stac-extensions/datacube/issues/8 , a suggestion was made to use the files extension to store chunk metadata, but I don't think that extension is appropriate for this purpose. Similarly, I don't think the Bands RFC https://github.com/radiantearth/stac-spec/pull/1254 addresses this problem, it is solving something entirely different.

CC @TomAugspurger we can handle chunk manifests later, they are ultimately just assets. Similarly, coordinate transforms are separate and probably better to wait for GeoZarr to standardize.

I'd like to know your thoughts on this proposal, or if perhaps this something worth putting into a hypothetical Zarr extension instead. IMO, I think the only thing that is very Zarr specific is the codecs property, everything else is very mappable with the underlying source files (even then, the files themselves define codecs too though they may not call them that).

TomAugspurger commented 3 months ago

Thanks. Having the all the information needed to construct a Dataset from a STAC item / list of items would be great.

Some comments on the proposed fields:

The raster extension defines a list of dtypes at https://github.com/stac-extensions/raster?tab=readme-ov-file#data-types. In general, the STAC metadata should probably match values used in STAC extensions. And then particular applications (like this xarray STAC engine) can map from the STAC names to the names it needs (NumPy dtypes)
For chunk_shape, there is a proposed ZEP for variable chunking: https://zarr.dev/zeps/draft/ZEP0003.html. Instead of a list[int] with a length equal to the number of dimensions, you give a list[list[int]], where the number of inner lists matches the number of dimensions, and the length of each inner list is the number of chunks. IMO a list[number] is fine and we can generalize stuff if/when that ZEP is accepted.
For the "optional" part of dimensions, do you have a recommendation for how to interpret dimensions being None? Does that mean all the dims in cube:dimensions apply?

ghidalgo3 commented 2 months ago

Agreed, we should re-use the numeric datatypes from the raster extension.
I suppose it doesn't add any complexity to the extension to specify that chunk_shape could be either a list[int] or a list[list[int]]. The first case will be most common but if/when that ZEP is accepted then the extension will already specify variable chunks.
Actually I lost my reasoning for making dimensions optional. The current spec says:

REQUIRED. The dimensions of the variable. This should refer to keys in the cube:dimensions object or be an empty list if the variable has no dimensions.

Maybe I was trying to avoid a possible ambiguity regarding which cube:dimensions, the asset-level dimensions or the item-level dimensions? But until I can regain that thread of thought, I take back any changes to the dimensions.

stac-extensions / datacube

Enriching datacube STAC items with more array metadata #18