Open ghidalgo3 opened 3 months ago
Thanks. Having the all the information needed to construct a Dataset from a STAC item / list of items would be great.
Some comments on the proposed fields:
raster
extension defines a list of dtypes at https://github.com/stac-extensions/raster?tab=readme-ov-file#data-types. In general, the STAC metadata should probably match values used in STAC extensions. And then particular applications (like this xarray STAC engine) can map from the STAC names to the names it needs (NumPy dtypes)chunk_shape
, there is a proposed ZEP for variable chunking: https://zarr.dev/zeps/draft/ZEP0003.html. Instead of a list[int]
with a length equal to the number of dimensions, you give a list[list[int]]
, where the number of inner lists matches the number of dimensions, and the length of each inner list is the number of chunks. IMO a list[number]
is fine and we can generalize stuff if/when that ZEP is accepted.dimensions
being None? Does that mean all the dims in cube:dimensions
apply?raster
extension.chunk_shape
could be either a list[int]
or a list[list[int]]
. The first case will be most common but if/when that ZEP is accepted then the extension will already specify variable chunks.dimensions
optional. The current spec says:REQUIRED. The dimensions of the variable. This should refer to keys in the cube:dimensions object or be an empty list if the variable has no dimensions.
Maybe I was trying to avoid a possible ambiguity regarding which cube:dimensions
, the asset-level dimensions or the item-level dimensions? But until I can regain that thread of thought, I take back any changes to the dimensions
.
Hello, I am interested in expanding the datacube STAC extension to support more multidimensional array metadata for assets, particularly array metadata found in NetCDF, HDF5, and GRIB2 files. I think I'm caught up on the great discussions of the past:
And the STAC catalog items that I've been working with are all hosted on Microsoft's Planetary Computer platform, specifically:
For context, my goal is to be one day be able to do something like this with Xarray:
In that example, assume that STAC items returned in the search contain
assets
which are the files themselves. I don't want to actually read the asset, I want the STAC item to contain enough information to create a manipulable dataset that Xarray understands. Reading comes after searching, merging, filtering, and projecting away the variables I'm not interested in.This proposal is heavily based on ZarrV3 though I believe any multidimensional array handling system will care to know the same information.
I propose the following additional properties on only
cube:variables
:data_type
string
numpy
parseable datatypechunk_shape
[number]
fill_value
number\|string\|null
dimensions
[string]
cube:dimensions
that index this variable. If not set, all dimensions index this variable. This may happen with single GRIB2 files that contain multiple datacubes.codecs
[object]
A new property that applies to either
cube:variables
orcube:dimensions
:attrs
object
In the previous discussion on this topic https://github.com/stac-extensions/datacube/issues/8 , a suggestion was made to use the files extension to store chunk metadata, but I don't think that extension is appropriate for this purpose. Similarly, I don't think the Bands RFC https://github.com/radiantearth/stac-spec/pull/1254 addresses this problem, it is solving something entirely different.
CC @TomAugspurger we can handle chunk manifests later, they are ultimately just assets. Similarly, coordinate transforms are separate and probably better to wait for GeoZarr to standardize.
I'd like to know your thoughts on this proposal, or if perhaps this something worth putting into a hypothetical Zarr extension instead. IMO, I think the only thing that is very Zarr specific is the
codecs
property, everything else is very mappable with the underlying source files (even then, the files themselves define codecs too though they may not call them that).