Need more control over dataset load precedence

Kirill888 commented 5 years ago

Introduction

Question:

Say I have two datasets with some area of overlap and I load them into one raster (using group_by='solar_day' for example).

Which pixel values should I expect to see in the area of overlap?
What mechanisms of control do I have over which dataset takes precedence, or how pixels are combined?

Answer:

Currently this depend on:

group_datasets behaviour parameterised by GroupBy object
AND also on load_data behaviour parameterised by fuser

First group_datasets doesn't just group datasets, it also orders datasets within the group. This order is then used by load_data to "fuse" one dataset at a time into a final raster. Default fuser behaviour is to never change output raster pixel value once "valid" pixel was observed. So in effect dataset order within a group is pixel precedence order. But one can just as easily implement fuser that overwrites previous pixel with the new ones (as long as they are valid), in which case dataset order will be the reverse of precedence order.

So what is the order of datasets as returned by group_datasets, can I control it? Well, kinda, but not really. Say you wanted to order datasets by some metadata like gqa score or granule_id.

Can I instruct group_datasets to group by solar day, but order by gqa? No. Order within a group is tightly coupled to axis value.

Can I write custom fuser that prefers pixels from one dataset over another? No. Fuser only sees pixels, it doesn't have access to dataset metadata.

You can write code that modifies order of datasets after group_datasets was called, that's about it. This forces you to use find_datasets -> group_datasets -> custom step -> load_data instead of just parameterising .load.

Change Proposal

Remove ordering from group_datasets area of responsibility
- Group is just an unordered collection of dataset objects
Configure dataset load precedence at load_data time rather than group_datasets time
Allow "don't care about order" option, when you know you don't have overlaps in the data, or just don't care about each run being exactly the same.

This allows to configure fuser and dataset order together, which is important since they are related.

Related issues: #643 #646 #615

emmaai commented 5 years ago

Can I instruct group_datasets to group by solar day, but order by gqa? No. Order within a group is tightly coupled to axis value.

Don't understand this part. group_datasets will return an xarray of datasets indexed by the time. Can the data of xarray be an ordered dictionary indexed by particular metadata, say gqa or granule_id?

2\. Configure dataset load precedence at `load_data` time rather than `group_datasets` time

Can live with this.

Kirill888 commented 5 years ago

@emmaai xarray axis ordering will remain as is, so your time axis will be ordered by time. What changes is interpretation of the "value" within this xarray.DataArray, currently value is a tuple of dataset objects where order within the tuple has meaning as far as data loading goes. I want it to be just a tuple of dataset objects where order is meaningless, or we can change it to be a set of dataset objects to clearly communicate that this is an unordered collection of dataset objects.

Kirill888 commented 5 years ago

Look at say pandas docs for groupby which I assume group_dataset is based on originally:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Groupby preserves the order of rows within each group.

Order within a group is the order on input, so if you had A0, B1, A1, B2 and grouped by letter, you will end up with A0, A1, B1, B2, not because 0 <1 but because A0 was before A1 in the original list. We should probably copy that behaviour for group_datasets, and re-order items within a group at load time.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

robbibt commented 4 years ago

Better control over data load precedence would still be a valuable feature to have in ODC core, particularly when combining multiple products using a list of datasets.

opendatacube / datacube-core

Need more control over dataset load precedence #671

Introduction

Change Proposal