Open RyanAhola opened 1 month ago
In the following you can read the current write-up from the OGC GitLab. It is based on and a short version of https://openeo.org/documentation/1.0/datacubes.html and if clarifications are needed, it's the best source to check.
Datacubes are multi-dimensional arrays with additional information about their dimensionality. Datacubes can provide a nice and tidy interface for spatiotemporal data as well as for the operations you may want to execute on them. Although arrays are close to raster data, datacubes can also hold vector data as well. GeoDataCubes (GDC) are a special case of datacubes in that they have one or multiple spatial dimension, e.g. x
and y
. GeoDataCubes for raster data often consist of the dimensions x, y, time and bands. Sometimes they also have multiple temporal dimensions. GeodataCubes for vector data often consist of geometries, time and a variable. Generally, datacubes can consist of any combination of dimensions - the dimensions are unrestricted. The spatial dimension of GeoDataCubes may get removed during processing.
The following additional information are usually available for datacubes:
These additional information could be provided upfront via metadata.
A dimension refers to a certain axis of a datacube. This includes all variables (e.g. bands), which are represented as dimensions. An exemplary raster datacube could have the spatial dimensions x
and y
, and the temporal dimension t
. Furthermore, it could have a bands
dimension, extending into the realm of what kind of information is contained in the cube.
The following properties are usually available for dimensions:
Specific implementations of datacubes may prescribe details such as sorting orders or representations of labels. For example, some implementations may always sort temporal labels in their inherent order and encode them in an ISO8601 compliant way.
Datacubes contain scalar values (e.g. strings, numbers or boolean values), with all other associated attributes stored in dimensions (e.g. coordinates or timestamps). Attributes such as the CRS or the sensor can also be turned into dimensions. Be advised that in such a case, the uniqueness of pixel coordinates may be affected. When usually, (x, y)
refers to a unique location, that changes to (x, y, CRS)
when (x, y)
values are reused in other coordinate reference systems (e.g. two neighboring UTM zones).
A couple of operations are commonly applied to datacubes:
Every operation that returns a subset of the datacube or the complete datacube is considered to be datacube access.
Every operation that is computing new values is considered to be datacube processing.
A coverage is a function which returns values from its range for a direct position within its domain, where the meaning of range and domain follow the usual definitions for a mathematical function. In practice, a data cube is more or less the same as a coverage, depending on the definition of a data cube. The concept of a coverage is agnostic of the mechanisms to generate, observe/measure, store or access data.
The domain of the coverage is made up of all dimensions where the coverage function can return a value (spatial, temporal, pressure levels...). Extra dimensions can be used beyond spatial and temporal, as long as the field values have a homogenous value along the dimension (e.g., the frequency of hyperspectral could be considered a dimension).
The individual values of the range can consist of one or more field, which are the observerd/measured properties at each position within the domain. The different fields are not considered a dimension.
The range set is the set of values within the range (the actual values, which can take the form of a multidimensional array in a gridded coverage)
The range type describes what kind of information is contained in each field of the range of the coverage.
The domain set is the description the domain of the coverage, which in the case of irregular gridded coverage and non-gridded coverage (e.g., point clouds), would contain the set of coordinates where values are available.
In coverages, there are two types of dimensional subsetting / filtering: slicing, which removes dimension on which the slicing occurs; and trimming which preserves the dimensionality of the output coverage.
There is the concept of "range subsetting" (called "field selection" in OGC API - Coverages), which can return a subset of the available fields.
Other operations such as aggregation on one or more dimension of the domain, and down/upsampling of the coverage, can also be performed on a coverage.
Further information: https://github.com/Open-EO/openeo-api/pull/502
A datacube as described here is closely related to the concept of a single xarray DataArray.
A datacube is comparable to a netCDF variable with its dimensions.
Based on what is discussed here and previously in https://gitlab.ogc.org/ogc/T20-GDC/-/issues/14 and #502, I started wondering whether the attempt to find a single definition for all these different incarnations of (geo)datacubes is at all possible. Maybe the only commonality is that a ‘(geospatial) Datacube’ stands for the desire to render a multitude of (geospatial) data interoperable and organise them such that working with them as an ensemble is more efficient than individually. This is of course too undetermined to build a good definition on it which could help to distinguish what is considered in and what out. Settling with that type of loose agreement would mean to renegotiate the term each time a concrete project is started (as seems here the case). This does not sound very efficient either.
A possible way out could be to understand ‘(geo)datacubing’ as a process with several stages which render (geospatial) data increasingly more organised and interoperable, such enhancing the efficiency to deal with them. Below is what that could look like (6 stages only because the analogy to cube faces). I would hope agreeing on certain ‘datacube stages’ might be easier than reserving the name just for one or from a specific stage.
Curious to hear other opinions, maybe it's just too hot an August afternoon here.
Stage |
Description |
Notes |
---|---|---|
1 |
Multitude of data which have sufficient metadata to allow ordering them along certain dimensions |
|
2 | Multitude of data which have declared dimensions to which all single data items can be referenced | |
3 |
Multitude of data which are referenced to more than one standardized dimension (one of them being a geospatial domain) |
At this stage, we have essentially a point cloud in an established CRS |
4 | Multitude of data block-wise co-registered (aligned) along at least one identified standardised dimension with all blocks sharing a common geospatial range | This stage marks the forming of layers or coverages which can be ordered and show a geospatial overlap |
5 |
All layers are co-gridded to a regular grid system |
At this stage, all data are organized in layers sharing the same grid or grid system (Q: Are the layers supposed to be gap-free?) |
6 | All layers have homologous discretisation (‘gridding’) along all their declared dimensions | At this final (ideal?) stage, the dimensions follow the same algorithmic set of rules, so that operations can equally be applied across all dimensions or domains |
Applicable definitions: Data Value and (usually) uncertainty of a trait of a specific entity
Dimension direction or aspect in which a trait can vary or be measured (a single type domain)
Domain n-dimensional space created by individual dimensions
Standardised dimension Dimension with a standardised (ISO, OGC, SI) reference system (Q: needs to have an axis?)
Layer A multitude of data in which all items share at least one metadata value (e.g. being on the earth surface or constant elevation)
Value state of a trait within a class or type (domain)
Building on recent discussion in Testbed-20 (https://gitlab.ogc.org/ogc/T20-GDC/-/issues/14), setting up a thread to discuss what the definition of what a "geodatacube" is. Goal is for the SWG to come up with a definition that can be referenced by OGC.