Generalise GroupBy to n-dimensions

Kirill888 commented 5 years ago

There is a step between find_datasets and load_data that transforms an un-ordered sequence of datasets into a structured container of datasets (xr.DataArray )

https://github.com/opendatacube/datacube-core/blob/ca72ac52849c54fc96a1e481a40549bfb151f413/datacube/api/core.py#L339-L352

group_datasets is "configured/parameterised" by this data structure:

https://github.com/opendatacube/datacube-core/blob/ca72ac52849c54fc96a1e481a40549bfb151f413/datacube/api/query.py#L36

Datacube.group_datasets assumes that there is one and only one non-spatial dimension, as a result one can not have "no non-spatial dimensions" or multiple non-spatial dimensions. Note that load_data doesn't make an assumption that there is only one non-spatial dimension.

Also there is a problem with the way group_by_func and sort_key are used, it seems to make an assumption that sort_key will return a higher fidelity version of whatever group_by_func returns (time and solar_day, or time and month).

If, say, one returned ds.center_time from sort_key (which also doubles as "give me value for non-spatial axis") and returned id(ds) from group_by_func things will not give expected result.

Kirill888 commented 5 years ago

Another common need is to "not group at all" (see #615), supporting that requires separating

Grouping of datasets
Assigning non-spatial axis value for a group

I think a better split would be

group_key: Dataset -> Any function that returns the same value for all datasets belonging to the same "group"
axis_value: List[Dataset] -> Tuple[...] that takes a group of datasets and computes axis value for a group

With this there is a problem when axis_value returns the same value for two different groups. So for example if you want to have time as non-spatial dimension, but don't want to group, how do you achieve this? One option is to add "dummy" dimension, other is to support duplicate entries, not sure how well xarray supports that.

emmaai commented 5 years ago

Also there is a problem with the way group_by_func and sort_key are used, it seems to make an assumption that sort_key will return a higher fidelity version of whatever group_by_func returns (time and solar_day, or time and month).

If, say, one returned ds.center_time from sort_key (which also doubles as "give me value for non-spatial axis") and returned id(ds) from group_by_func things will not give expected result.

We need sort_key to be something other than time or uuid, e.g., granule_id for Sentinel 2
Sort rule should return definite results rather than random ones for the same product. Currently uuid is generated randomly, if the product is regenerated, the chance is that the sorted results will be different.
We might need some more control over the fuse func, such as datasets with larger geo area (more valid data) are always precedence, etc.

Kirill888 commented 5 years ago

@emmaai I'm starting to think that maybe order of datasets as returned by group_datasets should not be meaningful. Ultimately it's a load-time concern. If I write a different kind of fuser that overwrites previously loaded pixels instead of current behaviour then I would want group_dataset to sort differently. Or maybe fuser interface is fixed to allow arbitrary reductions like average, then order is irrelevant. Or maybe my datasets are true tiles and there are no overlaps.

Currently order within group is coupled to "axis value", which is arbitrary limitation arising from implementation specifics not from some fundamental property. Also grouping and load order are separate enough, order is more closely related to fuser implementation, it should be fuser property really. I'll create new issue for this.

Regarding your option (3), ultimately order needs to be decided before load starts, maybe not in group_datasets but rather first step of load_data for a given time slice, so you will either need to keep track of more metadata or have to do a double pass over data (slow).

Remember that this is used not just to deduplicate overlapping pixels, but also when mosaicing, you can have thousands of files being merged into one, sometimes without any overlap whatsoever.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

opendatacube / datacube-core

Generalise GroupBy to n-dimensions #643