Open mathause opened 9 months ago
that this is prohibitively slow
https://github.com/xarray-contrib/cf-xarray/pull/478/files has some tricks :)
I think a way to tremendously speed this up is to first create a mask with the all_touched=True
option and then only check grid cells that "touch" a region. E.g. the ar6.all
regions has a density of about 1.8% for a global 1° grid. So instead of checking 3.7M grid cells-region combinations we'd only need to check 70k. This only works for equally-spaced grids, though. (Maybe test shapely.overlaps
for non-regular grids).
As always I am totally over-engineering this. I guess most users would actually be ok with a ~20 s wait...
I did something like that inflox
recently: https://github.com/xarray-contrib/flox/blob/bf936f526d7b9cd10c6632f9bf4ef98f18055d6f/flox/core.py#L253-L273
But I was also assuming that shapely.overlaps
had a fast path for that case :P . I've had good feedback from Martin F. so if you opened a PR branch with an example, we might get good suggestions
Proof of concept code (not nice yet, and not yet fully functional), requires xarray-contrib/cf-xarray#478.
import cf_xarray
import matplotlib.pyplot as plt
import numpy as np
import shapely
import xarray as xr
import regionmask
def mask_3D_frac_exact(ds, region_polygons, numbers):
bounds_polygons = cf_xarray.geometry.grid_to_polygons(ds)
shapely.prepare(bounds_polygons.values)
intersection = shapely.intersection(
bounds_polygons.values.flatten(), np.atleast_2d(region_polygons).T
)
bounds_area = shapely.area(bounds_polygons.values.flatten())
area_fraction = shapely.area(intersection) / bounds_area
area_fraction = area_fraction.reshape(-1, *bounds_polygons.shape)
mask = regionmask.core.mask.mask_to_dataarray(area_fraction, ds.lon, ds.lat)
mask = regionmask.core.mask._3D_to_3D_mask(mask, numbers, drop=False)
return mask
# ========================================================
# example
air = xr.tutorial.open_dataset("air_temperature")
air = air.cf.add_bounds(["longitude", "latitude"])
region_polygons = np.array(regionmask.defined_regions.ar6.land.polygons)
numbers = regionmask.defined_regions.ar6.land.numbers
mask = mask_3D_frac_exact(air, region_polygons, numbers)
# for plotting
lon = mask.lon.values
lon[lon >= 180] = lon - 360
mask = mask.assign_coords(lon=lon)
fg = mask.isel(region=slice(None, 6)).plot(col="region", col_wrap=3, cmap="Reds")
for ax in fg.axs.flatten():
regionmask.defined_regions.ar6.land.plot_regions(ax=ax)
plt.savefig("exact_overlap.png", dpi=300, bbox_inches="tight")
FWIW I was playing around with
from geopandas import GeoDataFrame
def estimate_fractional_weights(source: GeoDataFrame, dest: GeoDataFrame):
import xvec
# assert dest.columns == ["geometry"] # ???
columns = set(dest.columns)
assert len(columns) == 2
assert "geometry" in columns
# other is a nice string label for the geometry.
(other,) = set(columns) - {"geometry"}
assert source.crs == dest.crs
# union any duplicate geometries
dest = dest.dissolve(other).reset_index()
# TODO: always a good idea?
dest[other] = pd.Categorical(dest[other])
# calculate overlap geometries
overlay = source.overlay(dest, make_valid=False)
other_series = overlay[other]
# display(overlay)
grid_cell_fraction = overlay.geometry.area.groupby(other_series).transform(
lambda x: x / x.sum()
)
# print(f"sparsity={len(overlay) / (len(grid_df) * len(regions_df))}")
# display(grid_cell_fraction.groupby(other_series).sum())
multi_index = pd.MultiIndex.from_frame(
overlay[["latitude", "longitude", "SOVEREIGNT"]]
)
df_weights = GeoDataFrame(
{"weights": grid_cell_fraction.values, "geometry": overlay.geometry.values},
index=multi_index,
)
df_weights = df_weights.set_crs(dest.crs)
import xarray as xr
ds_weights = xr.Dataset(df_weights)
sparse_weights = ds_weights["weights"].unstack(sparse=True, fill_value=0.0)
# It would be nice if geometries had names
sparse_weights.coords["geometry"] = (other, dest.geometry.values)
# This doesn't pass alignment later :(
# https://github.com/pydata/xarray/issues/7695
# sparse_weights = sparse_weights.set_xindex(["geometry"], xvec.GeometryIndex)
return sparse_weights
Based on the discussion here, I wrote a function that I think worth sharing:
import xarray as xr
import numpy as np
import geopandas as gpd
import rasterio.features as rio_features
def area_weighted_agg(ds: xr.Dataset | xr.DataArray, gdf: gpd.GeoDataFrame | gpd.GeoSeries, x_dim: str, y_dim: str, feature_col: str) -> xr.Dataset:
"""Aggregate a dataset within geometries using area-weighted averaging.
Parameters
----------
ds : xarray.Dataset or xarray.DataArray
The dataset to aggregate.
gdf : geopandas.GeoDataFrame or geopandas.GeoSeries
The geometries to aggregate the dataset within.
x_dim : str
The x dimension name in `ds`.
y_dim : str
The y dimension name in `ds`.
feature_col : str
The column name in `gdf` that contains the feature ids
of the geometries. This will be used as a new dimension in
the resulting dataset.
Returns
-------
xarray.Dataset
The aggregated dataset.
"""
if isinstance(ds, xr.DataArray):
ds = ds.to_dataset(name=ds.name or "value")
if not isinstance(ds, xr.Dataset):
raise ValueError("ds must be an xarray.Dataset")
if not ds.rio.crs:
raise ValueError("ds must have a valid CRS")
ds = ds.rio.set_spatial_dims(x_dim=x_dim, y_dim=y_dim)
if isinstance(gdf, gpd.GeoSeries):
gdf = gdf.to_frame("geometry")
if not isinstance(gdf, gpd.GeoDataFrame):
raise ValueError("gdf must be a GeoDataFrame")
if not gdf.crs:
raise ValueError("gdf must have a valid CRS")
if not gdf.crs.is_projected:
raise ValueError("gdf must be in a projected CRS")
# Create a grid of the same shape as the dataset with box geometries
# and the dataset dims as two columns
shapes = rio_features.shapes(
source=np.arange(ds.rio.shape[0] * ds.rio.shape[1]).reshape(ds.rio.shape).astype("int32"),
transform=ds.rio.transform(),
)
grid = gpd.GeoDataFrame(geometry=[sgeom.shape(g) for g, _ in shapes], crs=ds.rio.crs)
y_idx, x_idx = np.unravel_index(grid.index, ds.rio.shape)
grid[x_dim] = ds.isel({x_dim: x_idx})[x_dim].to_numpy()
grid[y_dim] = ds.isel({y_dim: y_idx})[y_dim].to_numpy()
grid = grid.to_crs(gdf.crs)
# Overlay the grid with the geometries and calculate the fraction of
# each grid cell that is within each geometry
overlay = grid.overlay(gdf, make_valid=False)
frac = overlay.geometry.area.groupby(overlay[feature_col]).transform(lambda x: x / x.sum()).to_numpy()
# Create a multi-index from the geometry ids and the dataset coordinates
multi_index = xr.Coordinates.from_pandas_multiindex(overlay.set_index([y_dim, x_dim, feature_col]).index, 'dim')
# Create a sparse array with the fractions and the multi-index
weights = xr.DataArray(frac, coords=multi_index).unstack(sparse=True, fill_value=0.0)
# Apply the weights to the dataset and aggregate using dot product
ds_agg = []
for v in ds.data_vars:
ds_agg.append(xr.dot(ds[v], weights))
ds_agg[-1].name = v
agg = xr.merge(ds_agg)
# Remove the grid mapping variable if it exists
if clm.rio.grid_mapping:
agg = agg.drop_vars(clm.rio.grid_mapping)
return agg
I just discovered regionmask
👍 while discussing similar applications in wradlib
as shown here. I have some similar implementations in poligrain
which are focused on calculating the intersection weights of a line through a grid.
I am adding my info here, not for immediate action, but maybe for a discussion at some point, to see if and how there is potential for merging parts of the implementation.
In my application (using gridded weather radar data and path-averaged rainfall information from CMLs), I typically have large grids (900 * 900 pixels) and some thousands of short lines (covering max. 30 or 40 pixels). Hence, I get very sparse intersection weight matrices. In addition I typically want to derive long time series for each intersecting line. I have a solution that uses sparse
matrices and sparse.tensordot
to easily handle large datasets.
Here is the code that calculates the sparse intersection weights.
And here is the code that uses sparse.tensordot
to calculate the resulting time series.
Here is a notebook that shows the applications of the functions.
Note that this implementation is tied to the data dimension I use. It is not generic, but probably could be turned into something more generic with some effort.
Consider calculating exact overlaps using shapely/ geopandas. I am opening a new issue because I now have a PR that will close #38. I was always of the opinion that this is prohibitively slow - but I should actually test it :wink:
Originally posted by @dcherian in https://github.com/regionmask/regionmask/issues/38#issuecomment-1778507295