Average resampling in dc.load produces misleading results and lost data at coarse resolutions

robbibt commented 5 years ago

Expected behaviour

When using dc.load() to load high resolution data (e.g. 25m) at a low resolution (e.g. 5000m), I expect to be able to use resampling='average' parameter to return a low resolution output where each pixel has the average value of all contributing pixels in the native high-res data.

For example, I would like hi-res data that looks like this: Capture

...to be resampled using resampling='average' to look like the image below. This is what is generated on datacube version 1.6.1: Capture2

Actual behaviour

However, on datacube version 1.7, the aggregated low-res map that is returned is patchy, with a large amount of lost data. This result is non-intuitive for anyone expecting a raster resampled in a similar way to GDAL/Rasterio/common GIS software, and limits the usability of the resampled data.

Capture3

Steps to reproduce the behaviour

import datacube
dc = datacube.Datacube(app='Agg test')

query = {'x': (143.229842, 145.027985),
         'y': (-30.376446, -31.800706),
         'time': ('2011', '2011'),
         'output_crs': 'EPSG:3577',
         'resampling': 'average',
         'resolution': (-5000, 5000)} 

wofs_ds = dc.load(product='wofs_annual_summary', 
                  measurements=['frequency'], 
                  **query)

wofs_ds.isel(time=0).frequency.plot(size=6, vmin=0, vmax=0.2)

Environment information

datacube --version 1.7 on the NCI

Kirill888 commented 5 years ago

This is due to a change to use decimating read followed by re-project starting from 1.7 release, feature was added to support reads from overview images.

Data loading happens roughly like this within datacube (this for one time slice, ignoring multi-source fusing for simplicity):

Datacube figures out what region of source image to read and at what scale
Datacube ask IO driver to read that part (native projection, some cropping with scale change)
- IO driver either reads overview image followed by further sub-sampling if needed
- OR reads native resolution data followed by sub-sampling (problem is here)
Datacube performs memory to memory projection/rotation/translation change as needed with user configured resampling strategy to deliver final requested image.

Problem is in step 2 when image contains no overviews (this is the case for netcdf files). Down sampling by rasterio driver uses nearest by default, but most reasonable approach when down sampling by a large amount is to use average for numeric data and mode for categorical data.

Right now this is not possible since Datacube <> IO Driver interface does not include resampling mode parameter as part of read interface. This is a deficiency of the interface that requires breaking change to be addressed.

Ideally user should be able to configure both (2) and (3) currently only final reprojection is configured with resampling mode.

Configure initial shrinking read
- Use overviews if available: YES|NO
- Resampling to use for initial shrinking: average for numeric and mode for categorical/mask makes most sense as a default when shrink factor is >> 2.
Configure final reproject (this is already possible)

Note that average and mode are the only two methods that operate on an entire input image, other modes like cubic|bilinear|lancsoz only sample few pixels around the re-projected point, so you usually don't want those when shrink factor is large >>2, yet they make perfect sense for scale change <2. So it is important to decouple shrinking resampling configuration from final resampling configuration.

robbibt commented 4 years ago

I just came across this problem again recently after getting some very unexpected results when re-sampling coarsely. Is this potentially an issue that could be within the scope of ODC 2.0? I still think the current actual behaviour is very un-intuitive for anyone familiar with resampling data in GDAL/Rasterio/common GIS software.

Kirill888 commented 4 years ago

@robbibt at the moment there are no "clean" mechanisms to make any of this happen, 2.0 never starts and gets de-scoped, so... We don't want to start making "dirty workarounds", because 2.0 is "just about to start", but it never does and doesn't look like it ever will.

woodcockr commented 4 years ago

A hint of frustration there me thinks...

As ODC SC Chair I will be helping to herd the ODC community cats towards ODC 2.0 and have flagged this issue as belonging to the ODC 2.0 project Triage process. ODC 1.8 release needs to be finalised first and we'll also keep ODC 2.0 nice a tight updates wise but since this issue is in the core IO which is where ODC 2.0 is focussed its worth flagging it.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

robbibt commented 4 years ago

This issue is still a key blocker to using coarsely resampled data loaded from ODC; commenting to remove stale status

opendatacube / datacube-core