opendatacube / odc-geo

GeoBox and geometry utilities extracted from datacube-core
https://odc-geo.readthedocs.io/en/latest/
Apache License 2.0
83 stars 15 forks source link

Issue with cropped `xarray` data causing `xarray`'s `.where()` to behave differently #115

Closed robbibt closed 10 months ago

robbibt commented 11 months ago

While testing #114, I've been running into a really confusing issue that seems to cause xarray's .where to change the output size of my arrays.

Usually, a.where(b) simply masks out values from a that are False in b, replacing them with NaN. This normally doesn't have any effect on the shape of a (assuming the default drop=False) - you get back an array of the orginal shape after applying the mask.

For example, I might want to rasterize a polygon, then apply this to my original data as a mask. This works fine - my masked data comes out the same size as my original data:

import odc.geo.xr
from odc.geo.geobox import GeoBox
from odc.geo import geom

# Create xarray
geobox = GeoBox.from_bbox((-10, -2, 5, 4), "EPSG:4326", resolution=0.2)
xx = odc.geo.xr.xr_zeros(geobox)
print(f"Original shape is: {xx.odc.geobox.shape}")

# Create polygon overlapping xarray
poly = geom.Geometry(
    {"type": "Polygon", "coordinates": (((-5, 3), (0, 2), (-2.5, -1)),)}, "EPSG:4326"
)

# Rasterize polygon
rasterized = odc.geo.xr.rasterize(
    poly=poly.to_crs(xx.odc.geobox.crs), how=xx.odc.geobox
)
print(f"Rasterized shape is: {rasterized.odc.geobox.shape}")

# Apply mask
xx_masked = xx.where(rasterized)
print(f"Masked shape is: {xx_masked.odc.geobox.shape}")

# Verify that original data and cropped data have same geobox
assert xx.odc.geobox == xx_masked.odc.geobox

image

Problem

However, if I modify the shape of my array using .isel() or similar, .where() starts to behave differently. This example is the same, except I select a subset of my array using .isel() before rasterizing my polygon into its new geobox.

The rasterization works fine - my rasterized polygon and clipped array have the same geobox and shape. However, when I attempt to combine them with xx_cropped.where(rasterized), I get out an array with a different geobox, including different resolution and shape:

import odc.geo.xr
from odc.geo.geobox import GeoBox
from odc.geo import geom

# Create xarray
geobox = GeoBox.from_bbox((-10, -2, 5, 4), "EPSG:4326", resolution=0.2)
xx = odc.geo.xr.xr_zeros(geobox)
print(f"Original shape is: {xx.odc.geobox.shape}")

# Create polygon overlapping xarray
poly = geom.Geometry(
    {"type": "Polygon", "coordinates": (((-5, 3), (0, 2), (-2.5, -1)),)}, "EPSG:4326"
)

# Select subset
xx_cropped = xx.isel(latitude=slice(10, 20), longitude=slice(20, 50))
print(f"Cropped shape is: {xx_cropped.odc.geobox.shape}")

# Rasterize polygon
rasterized = odc.geo.xr.rasterize(
    poly=poly.to_crs(xx_cropped.odc.geobox.crs), how=xx_cropped.odc.geobox
)
print(f"Rasterized shape is: {rasterized.odc.geobox.shape}")

# Verify that cropped and rasterized data have the same geobox
assert rasterized.odc.geobox == xx_cropped.odc.geobox

# Apply mask
xx_masked = xx_cropped.where(rasterized)
print(f"Masked shape is: {xx_masked.odc.geobox.shape}")

# Verify that masked data and cropped data have same geobox
assert xx_cropped.odc.geobox == xx_masked.odc.geobox

image

Am I doing something silly here? Or is there some interaction going on between xarray and odc-geo that is causing .where to behave differently on data that has previously subsetted with .isel()?

The confusing thing is that xx_cropped and rasterized seem to have completely identical pixel grids, i.e. same geobox and even the same spatial dim coordinates...

robbibt commented 11 months ago

Comparison of the arrays... somehow xx_cropped and rasterized combine to make xx_masked... image

Kirill888 commented 11 months ago

that would be due to floating point precision most likely.

In theory we should have an invariant like this

xr_coords(gbox)['y'][3:7].data == xr_coord(gbox[3:7, :])['y'].data

but this might not be the case due to numeric precision.

Kirill888 commented 11 months ago

changing

- xx_masked = xx_cropped.where(rasterized)
+ xx_masked = xx_cropped.where(rasterized.data)

gives expected result as it ignores tiny differences in lon/lat:

rasterized.latitude.data - xx_cropped.latitude.data
>>> array([ 0.00000000e+00,  2.22044605e-16,  2.22044605e-16,  0.00000000e+00,
        0.00000000e+00, -2.22044605e-16,  0.00000000e+00,  0.00000000e+00,
       -2.22044605e-16,  0.00000000e+00])
Kirill888 commented 11 months ago

in general, when using xarray.where() use plain numpy array that matches data dimensions to be sure to bypass coordinate matching logic.

robbibt commented 10 months ago

Thanks Kirill!