exact overlaps using shapely/ geopandas

mathause commented 9 months ago

Consider calculating exact overlaps using shapely/ geopandas. I am opening a new issue because I now have a PR that will close #38. I was always of the opinion that this is prohibitively slow - but I should actually test it :wink:

Alternatively, https://notebooksharing.space/view/c6c1f3a7d0c260724115eaa2bf78f3738b275f7f633c1558639e7bbd75b31456#displayOptions= and xagg convert the regular grid to a list of polygons (one per grid cell) and use geopandas overlay to figure out the weights.

Originally posted by @dcherian in https://github.com/regionmask/regionmask/issues/38#issuecomment-1778507295

dcherian commented 9 months ago

that this is prohibitively slow

https://github.com/xarray-contrib/cf-xarray/pull/478/files has some tricks :)

mathause commented 9 months ago

I think a way to tremendously speed this up is to first create a mask with the all_touched=True option and then only check grid cells that "touch" a region. E.g. the ar6.all regions has a density of about 1.8% for a global 1° grid. So instead of checking 3.7M grid cells-region combinations we'd only need to check 70k. This only works for equally-spaced grids, though. (Maybe test shapely.overlaps for non-regular grids).

As always I am totally over-engineering this. I guess most users would actually be ok with a ~20 s wait...

dcherian commented 9 months ago

I did something like that inflox recently: https://github.com/xarray-contrib/flox/blob/bf936f526d7b9cd10c6632f9bf4ef98f18055d6f/flox/core.py#L253-L273

But I was also assuming that shapely.overlaps had a fast path for that case :P . I've had good feedback from Martin F. so if you opened a PR branch with an example, we might get good suggestions

mathause commented 9 months ago

Proof of concept code (not nice yet, and not yet fully functional), requires xarray-contrib/cf-xarray#478.

import cf_xarray
import matplotlib.pyplot as plt
import numpy as np
import shapely
import xarray as xr

import regionmask

def mask_3D_frac_exact(ds, region_polygons, numbers):

    bounds_polygons = cf_xarray.geometry.grid_to_polygons(ds)

    shapely.prepare(bounds_polygons.values)

    intersection = shapely.intersection(
        bounds_polygons.values.flatten(), np.atleast_2d(region_polygons).T
    )

    bounds_area = shapely.area(bounds_polygons.values.flatten())

    area_fraction = shapely.area(intersection) / bounds_area

    area_fraction = area_fraction.reshape(-1, *bounds_polygons.shape)
    mask = regionmask.core.mask.mask_to_dataarray(area_fraction, ds.lon, ds.lat)

    mask = regionmask.core.mask._3D_to_3D_mask(mask, numbers, drop=False)

    return mask

# ========================================================
# example

air = xr.tutorial.open_dataset("air_temperature")
air = air.cf.add_bounds(["longitude", "latitude"])

region_polygons = np.array(regionmask.defined_regions.ar6.land.polygons)
numbers = regionmask.defined_regions.ar6.land.numbers

mask = mask_3D_frac_exact(air, region_polygons, numbers)

# for plotting
lon = mask.lon.values
lon[lon >= 180] = lon - 360
mask = mask.assign_coords(lon=lon)

fg = mask.isel(region=slice(None, 6)).plot(col="region", col_wrap=3, cmap="Reds")

for ax in fg.axs.flatten():
    regionmask.defined_regions.ar6.land.plot_regions(ax=ax)

plt.savefig("exact_overlap.png", dpi=300, bbox_inches="tight")

exact_overlap

dcherian commented 9 months ago

FWIW I was playing around with

from geopandas import GeoDataFrame

def estimate_fractional_weights(source: GeoDataFrame, dest: GeoDataFrame):
    import xvec

    # assert dest.columns == ["geometry"]  # ???

    columns = set(dest.columns)
    assert len(columns) == 2
    assert "geometry" in columns

    # other is a nice string label for the geometry.
    (other,) = set(columns) - {"geometry"}
    assert source.crs == dest.crs

    # union any duplicate geometries
    dest = dest.dissolve(other).reset_index()

    # TODO: always a good idea?
    dest[other] = pd.Categorical(dest[other])

    # calculate overlap geometries
    overlay = source.overlay(dest, make_valid=False)
    other_series = overlay[other]

    # display(overlay)

    grid_cell_fraction = overlay.geometry.area.groupby(other_series).transform(
        lambda x: x / x.sum()
    )

    # print(f"sparsity={len(overlay) / (len(grid_df) * len(regions_df))}")
    # display(grid_cell_fraction.groupby(other_series).sum())

    multi_index = pd.MultiIndex.from_frame(
        overlay[["latitude", "longitude", "SOVEREIGNT"]]
    )
    df_weights = GeoDataFrame(
        {"weights": grid_cell_fraction.values, "geometry": overlay.geometry.values},
        index=multi_index,
    )
    df_weights = df_weights.set_crs(dest.crs)

    import xarray as xr

    ds_weights = xr.Dataset(df_weights)
    sparse_weights = ds_weights["weights"].unstack(sparse=True, fill_value=0.0)

    # It would be nice if geometries had names
    sparse_weights.coords["geometry"] = (other, dest.geometry.values)

    # This doesn't pass alignment later :(
    # https://github.com/pydata/xarray/issues/7695
    # sparse_weights = sparse_weights.set_xindex(["geometry"], xvec.GeometryIndex)
    return sparse_weights

cheginit commented 7 months ago

Based on the discussion here, I wrote a function that I think worth sharing:

import xarray as xr
import numpy as np
import geopandas as gpd
import rasterio.features as rio_features

def area_weighted_agg(ds: xr.Dataset | xr.DataArray, gdf: gpd.GeoDataFrame | gpd.GeoSeries, x_dim: str, y_dim: str, feature_col: str) -> xr.Dataset:
    """Aggregate a dataset within geometries using area-weighted averaging.

    Parameters
    ----------
    ds : xarray.Dataset or xarray.DataArray
        The dataset to aggregate.
    gdf : geopandas.GeoDataFrame or geopandas.GeoSeries
        The geometries to aggregate the dataset within.
    x_dim : str
        The x dimension name in `ds`.
    y_dim : str
        The y dimension name in `ds`.
    feature_col : str
        The column name in `gdf` that contains the feature ids
        of the geometries. This will be used as a new dimension in
        the resulting dataset.

    Returns
    -------
    xarray.Dataset
        The aggregated dataset.
    """
    if isinstance(ds, xr.DataArray):
        ds = ds.to_dataset(name=ds.name or "value")
    if not isinstance(ds, xr.Dataset):
        raise ValueError("ds must be an xarray.Dataset")
    if not ds.rio.crs:
        raise ValueError("ds must have a valid CRS")
    ds = ds.rio.set_spatial_dims(x_dim=x_dim, y_dim=y_dim)

    if isinstance(gdf, gpd.GeoSeries):
        gdf = gdf.to_frame("geometry")
    if not isinstance(gdf, gpd.GeoDataFrame):
        raise ValueError("gdf must be a GeoDataFrame")
    if not gdf.crs:
        raise ValueError("gdf must have a valid CRS")
    if not gdf.crs.is_projected:
        raise ValueError("gdf must be in a projected CRS")

    # Create a grid of the same shape as the dataset with box geometries
    # and the dataset dims as two columns
    shapes = rio_features.shapes(
        source=np.arange(ds.rio.shape[0] * ds.rio.shape[1]).reshape(ds.rio.shape).astype("int32"),
        transform=ds.rio.transform(),
    )
    grid = gpd.GeoDataFrame(geometry=[sgeom.shape(g) for g, _ in shapes], crs=ds.rio.crs)
    y_idx, x_idx = np.unravel_index(grid.index, ds.rio.shape)
    grid[x_dim] = ds.isel({x_dim: x_idx})[x_dim].to_numpy()
    grid[y_dim] = ds.isel({y_dim: y_idx})[y_dim].to_numpy()
    grid = grid.to_crs(gdf.crs)

    # Overlay the grid with the geometries and calculate the fraction of
    # each grid cell that is within each geometry
    overlay = grid.overlay(gdf, make_valid=False)
    frac = overlay.geometry.area.groupby(overlay[feature_col]).transform(lambda x: x / x.sum()).to_numpy()

    # Create a multi-index from the geometry ids and the dataset coordinates
    multi_index = xr.Coordinates.from_pandas_multiindex(overlay.set_index([y_dim, x_dim, feature_col]).index, 'dim')

    # Create a sparse array with the fractions and the multi-index
    weights = xr.DataArray(frac, coords=multi_index).unstack(sparse=True, fill_value=0.0)

    # Apply the weights to the dataset and aggregate using dot product
    ds_agg = []
    for v in ds.data_vars:
        ds_agg.append(xr.dot(ds[v], weights))
        ds_agg[-1].name = v
    agg = xr.merge(ds_agg)

    # Remove the grid mapping variable if it exists
    if clm.rio.grid_mapping:
        agg = agg.drop_vars(clm.rio.grid_mapping)
    return agg

cchwala commented 2 months ago

I just discovered regionmask 👍 while discussing similar applications in wradlib as shown here. I have some similar implementations in poligrain which are focused on calculating the intersection weights of a line through a grid.

I am adding my info here, not for immediate action, but maybe for a discussion at some point, to see if and how there is potential for merging parts of the implementation.

In my application (using gridded weather radar data and path-averaged rainfall information from CMLs), I typically have large grids (900 * 900 pixels) and some thousands of short lines (covering max. 30 or 40 pixels). Hence, I get very sparse intersection weight matrices. In addition I typically want to derive long time series for each intersecting line. I have a solution that uses sparse matrices and sparse.tensordot to easily handle large datasets.

Here is the code that calculates the sparse intersection weights.

And here is the code that uses sparse.tensordot to calculate the resulting time series.

Here is a notebook that shows the applications of the functions.

Note that this implementation is tied to the data dimension I use. It is not generic, but probably could be turned into something more generic with some effort.

regionmask / regionmask

exact overlaps using shapely/ geopandas #499