Open Kirill888 opened 5 years ago
Another common need is to "not group at all" (see #615), supporting that requires separating
I think a better split would be
group_key: Dataset -> Any
function that returns the same value for all datasets belonging to the same "group"axis_value: List[Dataset] -> Tuple[...]
that takes a group of datasets and computes axis value for a groupWith this there is a problem when axis_value
returns the same value for two different groups. So for example if you want to have time
as non-spatial dimension, but don't want to group, how do you achieve this? One option is to add "dummy" dimension, other is to support duplicate entries, not sure how well xarray
supports that.
Also there is a problem with the way
group_by_func
andsort_key
are used, it seems to make an assumption thatsort_key
will return a higher fidelity version of whatevergroup_by_func
returns (time and solar_day, or time and month).If, say, one returned
ds.center_time
fromsort_key
(which also doubles as "give me value for non-spatial axis") and returnedid(ds)
fromgroup_by_func
things will not give expected result.
sort_key
to be something other than time
or uuid
, e.g., granule_id
for Sentinel 2uuid
is generated randomly, if the product is regenerated, the chance is that the sorted results will be different.@emmaai I'm starting to think that maybe order of datasets as returned by group_datasets
should not be meaningful. Ultimately it's a load-time concern. If I write a different kind of fuser
that overwrites previously loaded pixels instead of current behaviour then I would want group_dataset
to sort differently. Or maybe fuser interface is fixed to allow arbitrary reductions like average, then order is irrelevant. Or maybe my datasets are true tiles and there are no overlaps.
Currently order within group is coupled to "axis value", which is arbitrary limitation arising from implementation specifics not from some fundamental property. Also grouping and load order are separate enough, order is more closely related to fuser implementation, it should be fuser property really. I'll create new issue for this.
Regarding your option (3), ultimately order needs to be decided before load starts, maybe not in group_datasets
but rather first step of load_data
for a given time slice, so you will either need to keep track of more metadata or have to do a double pass over data (slow).
Remember that this is used not just to deduplicate overlapping pixels, but also when mosaicing, you can have thousands of files being merged into one, sometimes without any overlap whatsoever.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
There is a step between
find_datasets
andload_data
that transforms an un-ordered sequence of datasets into a structured container of datasets (xr.DataArray
)https://github.com/opendatacube/datacube-core/blob/ca72ac52849c54fc96a1e481a40549bfb151f413/datacube/api/core.py#L339-L352
group_datasets
is "configured/parameterised" by this data structure:https://github.com/opendatacube/datacube-core/blob/ca72ac52849c54fc96a1e481a40549bfb151f413/datacube/api/query.py#L36
Datacube.group_datasets
assumes that there is one and only one non-spatial dimension, as a result one can not have "no non-spatial dimensions" or multiple non-spatial dimensions. Note thatload_data
doesn't make an assumption that there is only one non-spatial dimension.Also there is a problem with the way
group_by_func
andsort_key
are used, it seems to make an assumption thatsort_key
will return a higher fidelity version of whatevergroup_by_func
returns (time and solar_day, or time and month).If, say, one returned
ds.center_time
fromsort_key
(which also doubles as "give me value for non-spatial axis") and returnedid(ds)
fromgroup_by_func
things will not give expected result.