opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
510 stars 177 forks source link

[proposal] Add support for 3D datasets #672

Closed snowman2 closed 3 years ago

snowman2 commented 5 years ago

There are soil and weather datasets that use a third height/Z dimension for storing data. It would be nice to be able to have ODC optionally support datasets with this dimension.

Is there interest in adding this behavior to ODC?

@alfredoahds

snowman2 commented 4 years ago

Initial mockup: https://github.com/opendatacube/datacube-core/compare/develop...corteva:3dim. I plan to test on something we have internally to check if it works. If you notice any issues or have suggestions, please share. Thanks!

Kirill888 commented 4 years ago

@petewa :arrow_up:

Kirill888 commented 4 years ago

@snowman2 we can not have additional coordinates defined on a Dataset, it has to be on a Product, all datasets within a product must share common depth dimensions. This is what we agreed to in the proposal discussions. Supporting variably sized depth dimensions is out of scope, it's too hard to define and implement merging of those dimensions during load.

snowman2 commented 4 years ago

@Kirill888, thanks for taking a look - I am hoping this implementation will get some momentum going to get an initial version in datacube.

we can not have additional coordinates defined on a Dataset

The coordinates defined on the Dataset contain the values of the coordinates. This is needed in order to group and sort the datasets so they are put into the correct locations.

all datasets within a product must share common depth dimensions

With the current implementation, this is still the case. It expects all the datasets within a product loaded over the query to have all of the values of the dimension present. If this is not the case, I would expect incorrect behavior. I think it would be helpful to have the added dimension values defined on the Product as well in order to validate that the all of the expected values are represented by the collection of Datasets. Does that make sense/sound like a good idea?

Kirill888 commented 4 years ago

The coordinates defined on the Dataset contain the values of the coordinates. This is needed in order to group and sort the datasets so they are put into the correct locations.

That means that different datasets can have different coordinate values/sizes and hence would require fancy merging.

If those are fixed for all datasets, then that should be defined on a product, including coordinate values.

snowman2 commented 4 years ago

Sounds like there are more use cases to be considered. I will need to re-read the proposal later.

The current implementation assumes that a single dataset represents a single time slice and a single value for the additional coordinate(s). It requires each coordinate on each dataset to be a scalar.

Kirill888 commented 4 years ago

I see, what I meant by "3d support" was extending Dataset to encode not just a bunch of named 2D rasters (measurements/bands), but rather ND stack of 2D rasters (dim1, dim2,...dimN, Y, X), but with a caveat that all datasets in such an ND product encode a hyper-cube of the same shape except for Y,X dimensions. Y/X can be differently sized/covering different region per dataset.

What you seem to be proposing is a "common convention" for encoding coordinate values other than time and extending groupby mechanism to properly support those. No changes to IO driver are needed as the current model of Datase+band -> Single YX Pixel Plane remains as is. This approach was deemed untenable for hyper-spectral use-case as it would blow up number of datasets too much for sensors with hundreds of spectral bands.

What use-case are you targeting Alan?

snowman2 commented 4 years ago

What use-case are you targeting Alan?

Looks like I am making the simple modifications mentioned at the top of the thread & in the main section of the proposal. I missed (or likely forgot, since it has been a year) the changes mentioned later about supporting loading a n-d dataset. That sounds like quite a bit of changes needed in the IO drivers before this could happen as greater than 3-d dataset reading it isn't supported well with rasterio.

I assumed the issue was dying due to the bot, so the simpler the better is what I was thinking in order to get something added. The current implementation solves our needs as we only have the depth band on our datasets. But, even with that, the netCDF with multiple measurements and multiple depths was getting pretty large. So, it made more sense from our perspective to break it up into smaller chunks by the depth dimension and even measurement rather than storing it as a single large file.

Benefits of the existing implementation from our perspective:

I haven't heard much on the discussion here recently. Is there a timeline for developing the alternate method for loading n-d datasets mentioned in the modified section of the proposal?

snowman2 commented 4 years ago

After talking about it with @alfredoahds, we found that the measurements within the Product have depth that vary across measurements. So, the solution I proposed nor the other solutions mentioned here will resolve our needs. What we currently have is the depth built into the measurement name and though it is hacky, it allows for us to query for the depth of interest and have measurements with varying depths all within the same dataset. Once loaded, the dataset can be manipulated as needed. So, I think we will likely discontinue our pursuit of 3D dataset loading in datacube for now.

Kirill888 commented 4 years ago

It looks like parts of groupby logic were designed with an idea that any number of non-spatial dimensions are allowed, not just time. But then time was the only one that was supported/used/tested, and so other parts of the code now assume a single non-spatial dimension, see #643.

So to support your use-case

  1. Fix issue #643
  2. Generate separate Datasets for each non-time dimensions and record the coordinate value for each
  3. Use custom GroupBy (see step 1)

If that works fine, then we can move on to making (3) more generic by agreeing on a schema for recording coordinate values other than time in the Dataset document and then using that with a generic GroupBy that understands other dimensions than time.

xx = dc.load(..,
             group_by=multi_coord_groupby(['time', 'depth']))
assert xx.dims == ('time', 'depth', 'y', 'x')
snowman2 commented 4 years ago

@Kirill888, those steps would definitely be helpful to support 3D loading and I would definitely like to see that added.

Our use case is messier than I originally thought.

For example:

Our current workaround to enable loading them together is to add measurements for each depth. For example:

This allows for the measurements to be selected by depth and loaded together without conflicts.

What you describe will allow for loading each of the measurements separately in a 3d/4d manner. However, since they don't share the same z coordinate values, loading them together won't work out of the box. We could probably add a workaround where each measurement is loaded separately and then merging logic would need to be added.

petewa commented 4 years ago

I very much like the idea of a generic group-by to n-dimensions, but probably out of scope for this proposal.

As part of this proposal we are working on a ODC Zarr driver to support nD storage with the relevant changes to ODC core. One visible changes to ODC core is the change to dc.load to be able to query on all dimensions.

If this is your desired format: (we are aiming for something similar to this)

    measurement1 - coords: t, y, x
    measurement2 - coords: t, z, y, x (z: 1, 2, 3)
    measurement3 - coords: t, z1, y, x (z1: 2, 4, 6, 8)

The 3D Zarr driver should support this. (assuming you are ok with Zarr format. If not, a different storage driver will have to be written)

measurements1, measurements2, measurements3 will be three separate products, because of differences in the non-(y,x) dimensions.

You can query by any dimension (t, z, y, x) extent You can do a dc.load for each measurements1, measurements2, measurements3 Group-by will still be on time (for now, until PR#643)

If you want all 3 measurements as a single product, you'll need to do some custom merging post-load.

We are aiming to release the 2D Zarr driver by the end of August, and extending the driver to support 3D in early Sep.

Maybe wait until you see the 2D Zarr driver and we can discuss further? it may answer some of your questions.

Once the 3D Zarr driver is done, it shouldn't be too hard to extend it to support more storage formats if required. It mostly needs to be able to store a numpy nD array style object.

snowman2 commented 4 years ago

Interesting :thinking: . I haven't used Zarr yet. Sound like it might be worthwhile to give it a try.

petewa commented 4 years ago

Hi @snowman2

2D Zarr driver is released: https://dev.azure.com/csiro-easi/easi-hub-public/_git/datacube-zarr

Please let me know what you think and if it fits or doesn't fit with your use case.

We are staring work on the 3D Zarr driver and aiming to complete it by December 2020.

ghansham commented 4 years ago

Recently we wrote netcdf files using common data model for hyperspectral datasets which are compatible with xarray and gdal. I can share sample ncml (xml extension for netcdf) for such a dataset. I have also satellite sounding datasets with pressure dimension as well. Even model outputs can also act as a good starting point. Unidata thredds support these kind of datasets and can be accessed via opendap urls. For zarr we may need to work out

snowman2 commented 4 years ago

@petewa, thanks for sharing. It will be a while before I can try it out, but I definitely would like to 👍

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.