opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
510 stars 177 forks source link

[proposal] Add support for 3D datasets #672

Closed snowman2 closed 3 years ago

snowman2 commented 5 years ago

There are soil and weather datasets that use a third height/Z dimension for storing data. It would be nice to be able to have ODC optionally support datasets with this dimension.

Is there interest in adding this behavior to ODC?

@alfredoahds

omad commented 5 years ago

ODC is intended to support extra dimensions of data, and there is some support already. As far as I know there hasn't been much use yet, so there could be a few bugs.

The first step is to define a new Metadata Type. See default-metadata-types.yaml for examples. The new type can include extra search_fields for height, which will be of type double-range or integer-range.

The documentation is pretty sparse, but see how you go and we can discuss any problems that come up here.

Kirill888 commented 5 years ago

@omad pretty sure data loading assumes that raster is a 2D construct. So one would have to add new dataset for every height/Z slice, and then have some custom GroupBy object, and even then you'll lose time dimension since group_datasets allow only one non-spatial dimension see #643.

uchchwhash commented 5 years ago

@snowman2 yes, we have plans to overcome the obstacles that @Kirill888 describes. Like he says, the catch is that one dataset still represents one (2D) raster, so you'd need to have tags in the dataset (like, say, time) that describe the additional non-spatial dimension(s). So it is really Datacube.group_datasets that need to be updated. See #642 for one (failed) attempt.

alexgleith commented 5 years ago

I understand that CSIRO are looking at or already using dimensions to represent bands in a hyperspectral image product, is this correct @woodcockr and @petewa? What's your take here?

woodcockr commented 5 years ago

@alexgleith Yes, that is correct. A hyperspectral data cube project just received some seed funding to get the ball rolling. We also have need of 3 spatial dimensions for the reasons that @snowman2 outlined, as well as going into the solid earth. This n-dimensional aspect is one of several reasons we implemented aspects of netcdf support and S3AIO with this in mind. At the moment though ODC data representation seems to be taking a very geotiff style data structure so its a lot more awkward than it needs to be. It would be good to achieve n-dimensional support, especially since xarray supports it.

Kirill888 commented 5 years ago

Example: load 2 3D datasets, one has 10 depth layers for year 2010, another one has 17 for year 2018, how does the output look like? So we have 2 time axis values, what about depth? Anywhere between 17 and 27, what information we need to store per dataset to decide that, or do we decide after inspecting files? Do we always read all depth layers or do we allow this type of constraint: -10 < z < 30?

Generalizing to higher dimensions is hard, so it wasn't done. Properly general solution will be hard to come up with and probably hard to use as well. What's the minimum set of features people need?

woodcockr commented 5 years ago

I can't see how that example is any different to loading 2 2D data sets with differing size dimensions, so I'm not sure what you are getting at with this description?

In the hyperspectral case for n-dimensional support I currently view it like this: 200+ bands, how do we work with that... In the current ODC implementation a LS dataset looks like this:

Dimensions:          (time: 2, x: 492, y: 500)
Coordinates:
  * time             (time) datetime64[ns] 2017-01-07T23:50:29 ...
  * y                (y) float64 -3.953e+06 -3.953e+06 -3.953e+06 -3.953e+06 ...
  * x                (x) float64 1.542e+06 1.542e+06 1.542e+06 1.542e+06 ...
Data variables:
    coastal_aerosol  (time, y, x) float64 1.506e+03 1.506e+03 1.586e+03 ...
    blue             (time, y, x) float64 1.48e+03 1.48e+03 1.565e+03 ...
    green            (time, y, x) float64 1.626e+03 1.626e+03 1.639e+03 ...
    red              (time, y, x) float64 1.625e+03 1.625e+03 1.66e+03 ...
    nir              (time, y, x) float64 2.34e+03 2.34e+03 2.398e+03 ...
    swir1            (time, y, x) float64 1.389e+03 1.389e+03 1.378e+03 ...
    swir2            (time, y, x) float64 1.179e+03 1.179e+03 1.168e+03 ...
    pixel_qa         (time, y, x) float64 992.0 992.0 992.0 992.0 992.0 ...
    sr_aerosol       (time, y, x) float64 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 ...
    radsat_qa        (time, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
Attributes:
    crs:      EPSG:3577

Each of those lines of numbers are truncated for brevity but each is a value associated with a space-time coordinate. As you can see the Data Variables list each of the bands from Landsat by name (eg. blue). This is awesome, its part of what makes ODC so great to use, no more remembering that number “5” is swir1, just use the name. You can then do math on the dataset (call it ds) like this: ndwi = (ds.green - ds.nir) / (ds.green + ds.nir)

With hyperspectral, there would be 200+ of these (near ir, nearly ir, very nearly but not quite ir!), and more importantly, the type of math that can be expressed easily changes. It’s now a far better sampling of the spectral response so there is a tendency to treat it as a single array of wavelengths when using the data. There are a couple of ways this could potentially be done, the first is to add a dimension, wavelength, and would look like this (noting I am making up some numbers):

<xarray.Dataset>
Dimensions:          (time: 2, x: 492, y: 500, wavelength: 200)
Coordinates:
  * time             (time) datetime64[ns] 2017-01-07T23:50:29 ...
  * y                (y) float64 -3.953e+06 -3.953e+06 -3.953e+06 -3.953e+06 ...
  * x                (x) float64 1.542e+06 1.542e+06 1.542e+06 1.542e+06 ...
  * wavelength       (wavelength) float64 100 200 300 400 500 ... 2200 ...
Data variables:
    reflectance  (time, y, x, wavelength) float64 1.506e+03 1.506e+03 1.586e+03 ...
Attributes:
    crs:      EPSG:3577

Now each value in the “reflectance” data variable is still single value, but now referenced by space, time AND wavelength.

Alternately, you could represent it without an additional coordinate dimension, but use an array of 200 elements rather than a scalar in the data variable (I’ve made the change italic and big as its hard to see otherwise):

<xarray.Dataset>
Dimensions:          (time: 2, x: 492, y: 500)
Coordinates:
  * time             (time) datetime64[ns] 2017-01-07T23:50:29 ...
  * y                (y) float64 -3.953e+06 -3.953e+06 -3.953e+06 -3.953e+06 ...
  * x                (x) float64 1.542e+06 1.542e+06 1.542e+06 1.542e+06 ...
Data variables:
    reflectance  (time, y, x) float64[200] 1.506e+03 1.506e+03 1.586e+03 ...
Attributes:
    crs:      EPSG:3577

Now each value is a 200 element array, for each space and time coordinate (though I'm currently not sure if xarray supports this). To my mind at least it makes a more sense to have either of the last two structures (most likely the first), rather than a list of 200+ labels which also make representing some math awkward. It’s also still possible to label the wavelengths ‘nir’ etc where it makes sense to do so.

These alternates need to be stored somewhere. The ODC supports several Storage Units ranging from standard GeoTiff files through to more sophisticated formats like NetCDF, HDF5 and CSIRO’s experimental Cloud format called S3AIO (AWS S3 Array Input Output). To my knowledge, of all the Storage Units that ODC supports only HDF5 and S3AIO can store an arbitrary n-dimensional data set, as is the case for the two extra forms I’ve shown. Added to that, the ODC database and storage handling code would also need to be capable of that n-dimensional capability. This is called dimensional neutrality, in that an arbitrary number of dimensions can exist and they can be treated equally (e.g. you could reorder them and the system would remain performant along any dimension).

The current ODC implementation can handle the hyperspectral data, using labelled bands, its just that I believe it will become awkward very quickly, possibly prohibitively so. The alternates are a better solution in my view, can be done but it will take some time to work through the necessary changes. Support for other dimensions would of course follow naturally.

That's my thought bubble on the n-dimensional hyperspectral case anyway. It's a topic underway in another CSIRO project. There is a lot I don't know at this stage. Thoughts, collaboration, welcome as always.

Kirill888 commented 5 years ago

@woodcockr Hyperspectral is a bit different than depth data, it's more like "bands" but there are too many to assign meaningful labels like "red", I would expect that different datasets from the same hyperspectral sensor have the same number of "reflectance" bands so merging two or more datasets into one raster is a fairly obvious operation. This is not the case for depth data where every dataset might have different set of depths for which 2d rasters are available, how do you merge those? Should 10m and 11m rasters from two different datasets go into one raster plane or into 2 planes?

woodcockr commented 5 years ago

@Kirill888 Ahh, so your case is merging two different datasets, where mine was one dataset. That said, I would consider the depth example as a spare array - and merge the coordinates since they are the same data type - so two depths would occur (assuming same time and space).

pbranson commented 5 years ago

I would add my support for nDimensional functionality that is based on a co-ordinated nDimensional array.

Any math/aggregation/interpolation would be awkward and inefficient if dimensions higher than 3 (time,lat,long) were keys in a dictionary


From: Rob Woodcock notifications@github.com Sent: Thursday, February 28, 2019 11:22:05 AM To: opendatacube/datacube-core Cc: Subscribed Subject: Re: [opendatacube/datacube-core] [proposal] Add support for 3D datasets (#672)

@Kirill888https://github.com/Kirill888 Ahh, so your case is merging two different datasets, where mine was one dataset. That said, I would consider the depth example as a spare array - and merge the coordinates since they are the same data type - so two depths would occur (assuming same time and space).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/opendatacube/datacube-core/issues/672#issuecomment-468085221, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM3bQPoU0SqpRHydGxSGk0i0wodxCi90ks5vRyEtgaJpZM4bSw4C.

snowman2 commented 5 years ago

Example: load 2 3D datasets, one has 10 depth layers for year 2010, another one has 17 for year 2018, how does the output look like?

I currently like the idea @woodcockr presented to merge the depth dimension/coordinates together. Similar to how time is handled currently If there is a missing time step, ODC could just insert a NaN layer if it didn't have the depth value previously. This keeps the data in the "cube" format and makes accessing the data more intuitive in my opinion.

Kirill888 commented 5 years ago

On one hand .load_data supports more than one non-spatial dimensions: give it n-dimensional xarray of dataset tuples and you will get back (n+2)-dimensional array of pixels, with extra dimensions being y,x. We can trivially fix group_datasets to support n-dimensional output for the grouping also. But this is not enough to claim higher dimensional support. This would allow things like extra dimension for product version or sensor, but it won't help for hyper-spectral or depth-data.

Fundamental assumption that prevent easy implementation of true n-dimensional solution is this: (Dataset, band_name) -> Single Y,X raster. We assume that dataset encodes a collection of named 2D rasters, hence all one needs to load 2D rasters from a given dataset is a name of a band. Load needs to operate on 2D rasters at the lowest level as it does things like projection change and rescaling and unifying several datasets into one raster plane (mosaicking).

One can say, why not change the assumption to be: Dataset, band_name -> HyperCube, and then process last two dimensions like you do now. If you have infinite memory and can afford to wait a while this can work: turn every Dataset, band_name into HyperCube, resample y,x into some normalised space, then use usual xarray methods for joining a bunch of those into bigger HyperCube. In practice people have limited amount of RAM and get annoyed when "nothing happens" for minutes at a time, and many sooner than that. So most likely you won't want to load the entire HyperCube you'll want to select a subset of those 200+ wavelengths, or maybe you only want depth layers deeper than -100. For depth data people will want to fuzzy merge some depth layers that are "close enough" for their problem domain. Encoding all these needs is not trivial, solution space is vast.

woodcockr commented 5 years ago

With every challenge there is opportunity and I for one see we now have a thread that suggests there is demand to see ODC move to Dataset, band_name -> HyperCube. For Hyperspectral at least it will be necessary to access all bands, not a subset, it's how a lot of the math works. Memory or speed will be a secondary consideration to getting the job done.

The description of how this might work, whilst sensible, seems a little more awkward that I would think based on our S3AIO and netCDF work. Storage Units could be n-dimensional as well, so directly storing a sparse n-dimensional array. It seems to me this is a logical extension of the current approach so would follow similar pathing. The drivers for non-n-dimensional storage would clearly need some additional work to construct (and store) an ndarray, but perhaps we could have some drivers that support this and some that don't? Xarray seems to do this for some formats it supports. For the drivers that don't support the additional features it could simply pass through. In the end everything is an xarray, and it can do this, so I don't see why the ODC extensions should be any less capable or perform differently.

Kirill888 commented 5 years ago

@woodcockr if you remove (1) X,Y resampling (2) Subsampling and merging of other dimensions (3) Querying for non X,Y dimensions other than time (4) need to know the size and axis values of combined HyperCube ahead of time, then sure things are "easy"

woodcockr commented 5 years ago

@Kirill888 I didn't say it was easy, reinforced that fact it was a challenge, nor did I suggest removing the features you list so I am not sure how you concluded that.

It seemed to be the same path, assuming you can take the n-dimensional structure all the way to the Storage Units (and/or hide it behind a driver) and the performance hit and memory needs are matched with the increased value of getting the job done - which they are for the cases CSIRO are interested in.

FYI, from what I recall of the experiments we've done your point 4, the combined size of the HyperCube ahead of time, is known. Mind you I don't think we implemented all or 1-3 outside of the existing ODC framework handling of it, which is why things don't perform as well as they could.

CSIRO has a hyperspectral DC project starting up at the moment. I've reached out to them to find out what their planned approach is to see if they might join forces to tackle this together. Might make a good ODC Labs experiment given the number of folks expressing interest.

alfredoahds commented 5 years ago

A bit late, but just for reference: The dataset that sparked this issue from @snowman2 is a soils dataset with data for different depth ranges stored as a netCDF file.

<xarray.Dataset>
Dimensions:      (x: 65, y: 2, z: 11)
Coordinates:
  * z            (z) object '0-5' '5-15' '15-30' '30-45' '45-60' '60-75' ...
  * y            (y) float64 4.7e+06 4.7e+06
  * x            (x) float64 2.597e+05 2.597e+05 2.597e+05 2.597e+05 ...
    time         datetime64[ns] ...
    spatial_ref  int64 ...
Data variables:
    sand         (z, y, x) float32 ...
    silt         (z, y, x) float32 ...
    clay         (z, y, x) float32 ...
    ....
Attributes:
    ...

Our current workaround is appending the range to the attribute, e.g. sand_5_15, resuling in 100+ attributes per dataset. Without that workaround, since load_data assumes a shape of (len_observations,) + (geobox_shape,) the current setup will just load the first z-index layer.

woodcockr commented 5 years ago

@alfredoahds Your workaround is similar to the challenge we will face with the hyperspectral data -unwieldy no doubt. Drop me a line if you are interested in working with us on full n-dimensional support. We have a small amount resources to look into the issue and perhaps combined we can find a better solution.

Kirill888 commented 5 years ago

@alfredoahds thanks for the example, how consistent is this data across time and spatial domain? Are they all the same shape as far as z dimension go, do they all cover the same depth ranges? Do you have many tiles for the same time slice?

I noticed z dimension axis values are strings, which might complicate some things, particularly if ranges are not fixed across time or space.

woodcockr commented 5 years ago

For all the examples in my world the structure is consistent. It may be sparse (aka no data at the point in space, time, depth, pick an n-dimension) but it is consistent. It's why its being treated as a dimension.

Kirill888 commented 5 years ago

Supporting single fixed size non-t,x,y dimension is certainly much more feasible than supporting arbitrary N extra dimensions whose shape and axis value is configured per dataset, yet you need multiple datasets to be seamlessly fused together to form a single hypercube on load.

alfredoahds commented 5 years ago

Our use case is the same as @woodcockr with the z-index always consistently defined across the dataset. I'm not particularly attached to the string indices, those can be converted to a more convenient dtype and interpreted by the end users without much fuss.

I'll check with some in our group about our available capacity for working towards full n-dim support. @snowman2 has a research task on the board to look into the effort level/what would be required for support/etc.

snowman2 commented 5 years ago

Alright, so I tinkered around with the code and have a really rough implementation here for the purposes of furthering the discussion towards a sustainable implementation.

I think extra_dims could be renamed to 3d_dim for clarity.

Current usage:

z_coord = numpy.array(
    ['0-5', '5-15', '15-30', '30-45', '45-60', '60-75', '75-90', '90-105', '105-120', '120-150', '150-180']
)
Datacube.load_data(
    ...
    measurements=["om", "sand"],
    extra_dims=(("z", z_coord),)
)

Output:

<xarray.Dataset>
Dimensions:      (time: 1, x: 32, y: 926, z: 11)
Coordinates:
  * time         (time) datetime64[ns] 2019-01-01
  * z            (z) <U7 '0-5' '5-15' '15-30' '30-45' '45-60' '60-75' ...
  * y            (y) float64 4.709e+06 4.709e+06 4.709e+06 4.709e+06 ...
  * x            (x) float64 2.6e+05 2.6e+05 2.6e+05 2.601e+05 2.601e+05 ...
    spatial_ref  int64 0
Data variables:
    sand         (time, z, y, x) float32 ...
    ...
Attributes:
    ...
dataset.sand[0, 0].max()
<xarray.DataArray 'sand' ()>
array(42.056175, dtype=float32)
Coordinates:
    time         datetime64[ns] 2019-01-01
    z            <U7 '0-5'
    spatial_ref  int64 0

Knowing the Z coordinate information beforehand is not great. So, to address this:

  1. Could load it in from the first file
  2. Could load in in from the DB
  3. ...?

Also, do you have ideas on a better implementation? General thoughts?

woodcockr commented 5 years ago

@snowman2 nice workaround to allow further exploration, adding the additional dimensions metadata to the load API call. Getting it form the database as you point out would be more in keeping with the ODC architecture.

What happens if you perform a reprojection/resampling? Just wondering about the side effects. @petewa will likely be working on this for the CSIRO hyperspectral DC work which will be hosted in ODC labs in the near future (I hope).

Kirill888 commented 5 years ago

Thanks @snowman2 for feasibility study, just want to make sure everyone is aware that IO code is undergoing significant change to enable concurrent reads. It's great to experiment, just so long as there is no expectation for this to be merged in a hurry, also there is a lazy version of load that constructs dask arrays and that code path will also need equal support in the final version.

woodcockr commented 5 years ago

@Kirill888 Thanks for the heads up. CSIROs aware of course and discussed yesterday where we should peg our prototype from. We've decided to not wait for the IO refactor and work of just before that to test the concepts. Once it starts to look feasible we can come back and look how to do it for real and put up the necessary proposal to get it into production. That might be a while away but all things start small.

Kirill888 commented 5 years ago

I think the main issue here is to decide what we will and won't support, then define metadata structure necessary to implement that. Decisions need to be made across following axis:

  1. Generic n-d support vs 2d and 3d only
    • Note that this for Dataset -> Pixel space mapping only, load_data supports n-dimensions on output already as discussed above
  2. Fixed structure of extra dimension(s) vs different for every dataset
    • If going for varying size/axis values per datasets then need to decide what level of merging across non-time dimensions we need to support
    • If extra dimensions vary from dataset to dataset within a product then this information need to be defined in the dataset metadata document in standard form that needs to be defined.
    • If structure is fixed per product (hyper-spectral for example), then it still needs to be defined, but this time in product metadata.
  3. What mechanisms for sub-sampling across extra dimensions at load time we need to support

Either way following constraint need to be satisfied by metadata:

Other things to consider, do we still keep the constraint: one timestamp one dataset, if dataset can describe 3d data, do we allow time to be the third dimension? If not, how we prevent that from happening.

woodcockr commented 5 years ago

I'd be inclined to let time also be >1. This has been done previously and its a stronger generalisation. Has ramifications for some storage types when being updated and how provenance is done but there is nothing to stop an ODC deployment using just 1 timestamp to support different governance needs for their archive (based on what we are seeing happen its likely most EO custodians would go with 1 timestamp and anyone optimising for performance will go with whatever storage chunking makes sense to them).

Trade off being this might complicate some code, but last time it wasn't to bad.

Also, some back end storage formats probably won't support all multidimensional forms (not without some assist from the ODC code anyway). Perhaps the way forward is to have to some sensible baselines features but not cut off the facility to do more when it comes to the storage. xarray does something like this because it supports full nd but not all of its storage formats do.

Kirill888 commented 5 years ago

The issue with allowing time > 1 per dataset is that it forces you to implement all of the subsampling/merging/grouping behaviour we currently have for time axis as part of the support for 2d+ datasets, and also forces your choice for point (2) to be most generic (time axis is never fixed across all datasets), and strongly suggests generic solution for (1) as well.

There are also implications for provenance tracking like you(@woodcockr) mentioned already, provenance for part of dataset doesn't really make sense. As far as provenance go Dataset is not sub dividable, we do not track which bands were used for example.

I think efficient data access for datasets that share common storage unit can be achieved by other means and current IO refactor will provide necessary tools for IO driver to implement those. I'm familiar with this problem since we use stacked netcdf files where the penalty on IO is performance is quite obvious.

snowman2 commented 5 years ago

Generic n-d support vs 2d and 3d only

I would base this decision based on the use cases. Is there an existing need out there for anything other than 2d/3d? If not, I would go with just 2d/3d for simplicity.

Fixed structure of extra dimension(s) vs different for every dataset

Again, is there a use case for "different for every dataset". If not, I would go with fixed structure for simplicity sake.

do we still keep the constraint: one timestamp one dataset

What use cases are there for time >1? What benefit will be gained from this? If minimal benefit, then I would go with one time stamp per dataset.

If dataset can describe 3d data, do we allow time to be the third dimension? If not, how we prevent that from happening.

I think that the metadata and defined default behavior can be the solution here.

+1 for storing the information about the dimensions in the metadata.

Anyway, just my initial impression. I think making small steps is a good path forward. Maybe start with the simple implementation of only 2d/3d structured datasets with 1 timestamp per dataset. Then, on the next iteration add nd datasets. Then on the next iteration, add datasets with varying sizes. Etc...

woodcockr commented 5 years ago

Generic n-d support vs 2d and 3d only

I would base this decision based on the use cases. Is there an existing need out there for anything other than 2d/3d? If not, I would go with just 2d/3d for simplicity.

We need to take case with what we mean by 2d/3d in this context. So far we have need of 2d geospatial + 1d depth +/or 1d hyperspectral + 1d time (sparse)

I added time as "sparse" since it most commonly is, though of course the others could be.

Fixed structure of extra dimension(s) vs different for every dataset Again, is there a use case for "different for every dataset". If not, I would go with fixed structure for simplicity sake.

Agreed, I think we are talking about fixed dimensions, though the dataset content are sparse.

do we still keep the constraint: one timestamp one dataset

What use cases are there for time >1? What benefit will be gained from this? If minimal benefit, then I would go with one time stamp per dataset.

Any storage format that can handle full n-dimensional support (eg. HDF, netCDF, Zarr, S3AIO) can provide a performance benefit when performing calculations along the time dimension for certain computations. It's the same as any access pattern optimisation problem, how you do the chunking and the benefit you get.

I think making small steps is a good path forward. Maybe start with the simple implementation of only 2d/3d structured datasets with 1 timestamp per dataset. Then, on the next iteration add nd datasets. Then on the next iteration, add datasets with varying sizes. Etc...

We are already multi-dimensional, just not flexible with multiple additional dimensions so I think the next small step is really n-dimensional, fixed dimensions across datasets. Time's sparsity is an interesting case though with 1 timestamp per dataset being the current cure.

@Kirill888

I think efficient data access for datasets that share common storage unit can be achieved by other means and current IO refactor will provide necessary tools for IO driver to implement those. I'm familiar with this problem since we use stacked netcdf files where the penalty on IO is performance is quite obvious.

You're not alone with your familiarity, I coded some of the DCs first netcdf trials. I'm not quite sure which direction you are coming from with "by other means", and which penalties you are referring too.

Could you elaborate on your thinking and experience please ?

Kirill888 commented 5 years ago

I think arbitrary n-d data per dataset is too complex not just from implementation but also from interface point of view. Having time both across Datasets and within Dataset is too confusing, and has implications for provenance and other issues describe above.

So here is my preferred set of requirements:

  1. Dataset[band] -> 2d | 3d pixels per band
    • All bands share the same number of dimensions: 2d or 3d
    • Individual bands can have different resolution in y,x, but not z. z, if present, is fixed across bands
    • z dimension can not be time dimension, create extra Datasets for that (just like now)
    • Last two dimension are y,x, so assume: [z, ]y, x
  2. Assume fixed size extra dimension across all datasets and all bands within a product
    • Extra dimension should be defined in the product definition
      • Name for z axis: str anything compatible with python variable name, for the exception of time|t|x|y|longitude|latitude
      • Values for coordinates of the axis: List[float|str] matching size of the dimension, no duplicates, sorted in the same order as returned by .read
  3. Slicing of z dimension on load should be supported. People will want rgb mosaics from hyperspectral and they won't want to wait for 20-30+ longer than needed to construct them.

@woodcockr regarding IO from the same storage unit referenced by multiple datasets being loaded at the same time:

  1. Cost is in re-opening netcdf file and possibly reading data several times depending on chunking regime within the file. In our case stacked netcdf have chunking of 5 across time dimension(some silly NCI requirement that was forced on us), so we throw away 80% of work done due to current IO driver design.
  2. Way to address that is partially described here: https://github.com/opendatacube/datacube-core/issues/625#issuecomment-454647552
    • Basic idea is to allow driver to have cache, and to tell driver ahead of time what will be loaded, or possibly will be loaded if using lazy arrays via dask
snowman2 commented 5 years ago

I like the requirements 👍

Kirill888 commented 5 years ago

@woodcockr also see (4) and (5) in the doco I shared with CSIRO earlier:

https://gist.github.com/Kirill888/be5e61317ae2ea69935a0811fe971f89#guiding-principles

alexgleith commented 5 years ago

Just wanted to point out the datacube extension proposal for STAC, over here: https://github.com/radiantearth/stac-spec/tree/master/extensions/datacube. Could be relevant context for this discussion.

woodcockr commented 5 years ago

@alexgleith thanks for that. that extension proposal is relevant in terms of specification but not in terms of implementation of course.

Looking over the discussion it seems to me that two things are being maintained:

  1. spatial dimensions are grouped together
  2. attribute dimensions precede time+spatial dimensions

This results in indexing of an attribute+time slice to be a contiguous nd-array which can be treated as a uniform type. This is a consistent extension to the current DC bands+time+space ordering - just the bands are labelled attributes rather than an explicit dimension. For hyperspectral this could be made a preceding dimension to prevent the awkward label explosion. Let's call this type A.

The alternate of ordering time, space, attribute, would result in the contiguous nd-array being the attributes ( time+space sliced). Call this Type B

Both could work but have implications for storage and use in mathematical expressions. There are times when a hyperspectral signal is treated as a whole, not just individual spectra as is the case for multi-spectral. Type B is more natural for the "treat it as a whole"case.

Type A seems to suit the provenance case and is more incremental to the current ODC implementation. But the "test is as a whole" case for the hyperspectral bands is going to be interesting. Numpy et al do some funky things these days to speed up that kind of thing so we'll see I guess.

Just thinking out loud trying to get my head around the implications for use.

Kirill888 commented 5 years ago

@woodcockr I haven't really specified axis order other than y,x are the last two to make is easy for reprojecting case. Given prototype code by @snowman2 seems like the most likely axis order is t,[z],y,x, where z might be omitted for datasets that don't have 3d dimension.

This order makes sense to me, since datacube collates datasets along time dimension, and each dataset is either a single 2d raster, or a fixed number (>1) of 2d rasters.

One can always (RAM permitting) transpose post-load for a particular data access pattern. I'd rather avoid trying to be over flexible with load-time axis order, yes this would allow most efficient way to load data, but we simply can not afford that level of complexity and corresponding test load.

woodcockr commented 5 years ago

@Kirill888 If you include the existing attribute access I think its band, t y x currently - where the last three are xarray dimensions. For hyperspectral that would extend to wavelengths, t y x. Then if you add a spatial depth it would be wavelength, t z y x OR wavelength z t y x where all are xarray dimensions.

I think that is all the options. Which is preferable ordering wise? Which reduces code changes?

So long as y x is on the end you get the easy reprojecting case. What it does for how you slice and dice to math varies though. With hyperspectral wavelengths as a dimension though you can at least slice it readily making the math less awkward (same for depth).

we simply can not afford that level of complexity....

Why? No need to answer that just highlighting that the ODC and EO needs are currently narrow, and it's clear from other DC related initiatives globally that dimensional neutrality is a much sought after feature and for good reason. I'm happy to see it evolve in ODC whilst keeping all the simple things ODC makes easy and fast and not leap to dimensional neutrality. We do have call for hyperspectral and depth so far and judging by this discussion it looks like we can keep this incremental.

Kirill888 commented 5 years ago

@woodcockr I thought we agreed to limit support to 2d|3d only per dataset, so you have to choose hyperspectral or depth, can not have both at the dataset level. We still keep band, it doesn't go anywhere, but for every band, dimensions are t,{optional_extra_dim},y,x or t,y,x.

Note that you can still have more dimensions at load time, you just have to create extra dataset objects for every value along that extra dimension, you can do this right now by implementing custom group by. Then it's your choice whether to put t,depth or depth,t before wavelenegth, y, x. Extra dimension per dataset will always go right before y,x, all other dimensions (including time) are decided by group_datasets code that has user customisable hooks.

Why? No need to answer that just highlighting that the ODC and EO needs are currently narrow

Because complexity budget is finite, if hard to asses, and more people will benefit from this complexity being spent on concurrent data loading in to a fixed structure than being spent on arbitrary complex rules for how to structure dimensions during loading.

woodcockr commented 5 years ago

@Kirill888

I thought we agreed to limit support to 2d|3d only per dataset,

No, looking back at my comments I see only that you proposed this as a preferred set. I've then started a discussion on what this means. I can't see a complete set of agreed comments.

What I don't understand is this:

We still keep band, it doesn't go anywhere, but for every band, dimensions are t,{optional_extra_dim},y,x or t,y,x.

In the hyperspectral case the bands need to be a dimension, not just labelled attributes. That way they can be sliced. To some extent I'm not overly bothered by the order of dimensions but the ability to be able to express a slice across bands. If "we still keep band" in its current form then the proposal doesn't address the hyperspectral need for a wavelength dimension. Assuming we didn't require the z dimension, if I've understood you correctly we'll end up semantically with "band" as an attribute label AND wavelength as a dimension - which is not particularly helpful.

Did I misunderstand something?

I'm keen to add a z spatial dimension, the earth is after all 3D and it opens up some geophysics related opportunities for the DC. If we do both hyperspectral wavelength and 3D spatial, what is the logical extension occurring? Could we not generalise "bands" as a dimension to support multiple additional observations/attributes? It is actually a dimension now, it's just not accessed that way. I don't think using group_by is a good option for this, it should be more natural if this is indeed a Data Cube.

Clearly there are some specific limitations being placed here by the ODC code since xarray and some of the storage formats can support arbitrary dimensions. I'm all for a set of sensible limits to reduce complexity and ease use. just need to make sure we can readily support quite logical extensions: observations as dimensions (not just labelled attributes) and spatial depth being the most logical.

Because complexity budget is finite, if hard to asses, and more people will benefit from this complexity being spent on concurrent data loading in to a fixed structure than being spent on arbitrary complex rules for how to structure dimensions during loading.

That is a perspective of course. Evidenced based reviews of requirements conducted by other groups show that there is demand for dimensional neutrality., aka "more people" want that and in some groups and projects people are prepared to pay for it. It may be beyond the bounds of existing funding, but certainly isn't out of the question. Right now though we have a need for some observations as a dimension and z depth.

Kirill888 commented 5 years ago

@woodcockr output of .load is xarray.Dataset which is a collection of labelled xarray.DataArrays, it's just a fancy dictionary mapping str -> ndarray with labelled axis. We can't change .load to sometimes return xarray.Datarray for hyperspectral case, so you'll have to have a name for your hyperspectral "band", that has wavelength,y,x axis (in that order).

That is a perspective of course. Evidenced based reviews of requirements conducted by other groups

Fine, scientists I support, and people who pay MY salary are more interested in increasing throughput and decreasing latency for normal 2D case than supporting arbitrary number of dimensions on load for data sources they don't even have access to. Because increasing throughput and decreasing latency pays off in reduced cloud costs and increased productivity of researchers.

Surely you can not argue that 2d datasets are more available, and more used than 3d or more, particularly looking into publicly accessible data.

Nothing prevents you from using zarr or netcdf libs to store and load arbitrary nd-data, just not through dc.load, use dc.find_datasets for query then do whatever you want to turn that into nd-raster.

woodcockr commented 5 years ago

@Kirill888 This is an ODC community, not a "my organisation" project. I am simply advocating for a feature set and being explicit as to the drivers for that. The ODC community as a whole may choose to do something else, which is fine. It is important though that these other community views be considered as others may well contribute and/or come up with one that does it without penalty. All core contributors are faced with the same dilemma that you face in terms of what they are supported to do by their organisation, but that is not a reason to not support an ODC proposal. The ODC Partners recently agreed to jointly support some new features, so perhaps you will get to work on some ODC extension in the future with broader need than those in your immediate organisation.

So back to the primary topic, which is getting dc.load to do what we, the ODC community, would like (my interest is ODC, not other tools).

If I have understood you correctly your advocating for this form of data output:

<xarray.Dataset>
Dimensions:          (time: 2, wavelength: 200, x: 492, y: 500, )
Coordinates:
  * time             (time) datetime64[ns] 2017-01-07T23:50:29 ...
  * y                (y) float64 -3.953e+06 -3.953e+06 -3.953e+06 -3.953e+06 ...
  * x                (x) float64 1.542e+06 1.542e+06 1.542e+06 1.542e+06 ...
  * wavelength       (wavelength) float64 100 200 300 400 500 ... 2200 ...
Data variables:
    reflectance  (time, y, x, wavelength) float64 1.506e+03 1.506e+03 1.586e+03 ...
Attributes:
    crs:      EPSG:3577

I've hopefully just restated what you said earlier, but in my own words to ensure I have understood. If this is correct then...

Yes,I believe the hyperspectral case can be met by this construct being supported in dc.load() as can the depth case (but not both).

I believe the Hyperspectral team will be content with this approach being supported. I have opportunity to influence how it might be implemented so if you have a specific implementation approach in mind let me know and I'll pass on the interest. @snowman2 same for you if you agree with this approach (which you seemed to earlier), I am sure they would welcome collaboration.

Kirill888 commented 5 years ago

@woodcockr your understanding looks correct, with a minor caveat that it is possible to have more dimensions, but not described by one single dataset, so if you have hypespectral data across several depths you can model that, but you will need to generate extra datasets along the depth dimension and you will have to use custom group_by construct.

pbranson commented 5 years ago

I would like to raise a (perhaps esoteric) usecase I have wondered about representing in the ODC. It involves deriving a 2D-spectrum of the wind waves from fragments of sentinel 2 data (as published here http://doi.org/10.1002/2016JC012426)

Dataset would conceptually look something like this:

Dimensions: (time: 2, frequency: 50, direction: 50, x: 492, y: 500, ) Coordinates: * time (time) datetime64[ns] 2017-01-07T23:50:29 ... * y (y) float64 -3.953e+06 -3.953e+06 -3.953e+06 -3.953e+06 ... * x (x) float64 1.542e+06 1.542e+06 1.542e+06 1.542e+06 ... * frequency (frequency) float64 100 200 300 400 500 ... 2200 ... * direction (direction) float64 100 200 300 400 500 ... 2200 ... Data variables: variance (time, y, x, frequency, direction) float64 1.506e+03 1.506e+03 1.586e+03 ... Attributes: crs: EPSG:3577 So with the current 'proposal' I would model it by decomposing one of frequency or direction dimensions into 'bands'. So dc.load would hypothetically return Dimensions: (time: 2, frequency: 50, x: 492, y: 500, ) Coordinates: * time (time) datetime64[ns] 2017-01-07T23:50:29 ... * y (y) float64 -3.953e+06 -3.953e+06 -3.953e+06 -3.953e+06 ... * x (x) float64 1.542e+06 1.542e+06 1.542e+06 1.542e+06 ... * frequency (frequency) float64 100 200 300 400 500 ... 2200 ... Data variables: direction1 (time, y, x, frequency) float64 1.506e+03 1.506e+03 1.586e+03 ... direction2 (time, y, x, frequency) float64 1.506e+03 1.506e+03 1.586e+03 ... direction3 (time, y, x, frequency) float64 1.506e+03 1.506e+03 1.586e+03 ... directionN (time, y, x, frequency) float64 1.506e+03 1.506e+03 1.586e+03 ... Attributes: crs: EPSG:3577 Which after dc.load I could use xr.concat([direction1, direction2,...]) to recompose into a single DataArray? And the data could be stored on disk in some arbitrary format supported by the new futures IO proposal, so the dc.load and xr.concat (with dask) should return without actually pulling the data? On Wed, Mar 13, 2019 at 3:02 PM Kirill Kouzoubov wrote: > @woodcockr your understanding looks > correct, with a minor caveat that it is possible to have more dimensions, > but not described by one single dataset, so if you have hypespectral data > across several depths you can model that, but you will need to generate > extra datasets along the depth dimension and you will have to use custom > group_by construct. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > , > or mute the thread > > . >
Kirill888 commented 5 years ago

@pbranson if you have more than 1 non-time-y-x dimension axis, things will still be annoying to work with with the current proposal. Approach you describe will work, but is only marginally better than what you already can do (you will still need 50 variables, instead of 2500).

Another approach is to generate 1 dataset for every directionN (so 50 datasets covering same region-time in your example), then use custom group_by to arrange those datasets in a hypercube.

Other approach still, is to have a fake axis containing 2500 fake points and then reshape that into 50x50. This could be an ok escape route for when you do need more than 1 extra dimension. Again this only works when your extra dimensions have fixed shape across all datasets (and bands) within a product.

Kirill888 commented 5 years ago

@alexgleith as discussed this morning I have created wiki page for this proposal:

https://github.com/opendatacube/datacube-core/wiki/ODC-EP-001---Add-Support-for-3D-Datasets

alexgleith commented 5 years ago

Excellent. Thanks, @Kirill888

woodcockr commented 5 years ago

@pbranson I don't believe your case is esoteric, though it does currently sit just outside the bounds of the ODC code base and its convenience for handling data. Check you email, I have a suggestion for you.

snowman2 commented 4 years ago

Any more thoughts on this?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

snowman2 commented 4 years ago

Just adding a note here that I plan to make an attempt based on the current version of ODC. My current thought is to add something like this to the Dataset class:

    @property
    def additional_coordinates(self) -> List[Dict[str, Any]]:
        """
        This is for n => 4 dimensional rasters.

        The order for adding coordinates:
          (time, *additional_coordinates, y, x)

        Requirements:
        - Returns an ordered list with the order and contents being consistent across all datasets.
        - The values should be scalars.

        Example::

        [
            {
                "name": "depth",
                "description": "Depth from surface.",
                "units": "meter",
                "value": 10,
            }
        ]

        """
        try:
            return self.metadata.additional_coordinates
        except AttributeError:
            return []

And then using this information to update the groupby and sorting logic when loading the datasets.