thinkingmachines / geowrangler

🌏 A python package for wrangling geospatial datasets
https://geowrangler.thinkingmachin.es/
MIT License
48 stars 15 forks source link

Spatial join of polygons with highest intersection #164

Closed alronlam closed 2 years ago

alronlam commented 2 years ago

This is from @joshuacortez:

Context: When spatial joining 2 geodataframes with polygons, we can have duplicate matches. One way this has been addressed in the past was to convert one of the gdfs to points by getting the centroid. That can work but some polygons can be unmatched because it depends heavily on the position of the centroid.

A more comprehensive approach (but also more computationally expensive) is to join polygons based on the highest intersection.

Inputs: 2 gdfs, and columns for the unique ID per gdf, and proj_crs (not sure if we can leave this out entirely) Output: gdf matched IDs.

Reference implementation (this still has to be refactored and generalized):

def get_highest_intersection_area_name(
    gdf: gpd.GeoDataFrame,
    admin_bounds: gpd.GeoDataFrame,
    id_name: str,
    proj_crs: str,
    area_names: List[str],
) -> gpd.GeoDataFrame:

    overlay = gdf.overlay(admin_bounds, how="intersection")
    overlay["geometry"] = overlay["geometry"].to_crs(proj_crs)
    overlay["area"] = overlay.geometry.area
    overlay = overlay.sort_values(by="area", ascending=True)
    overlay = overlay.drop_duplicates(subset=id_name, keep="last")
    assert not overlay[id_name].duplicated().any()

    usecols = [id_name] + area_names
    id_to_name = overlay.loc[:, usecols].sort_values(by=id_name, ascending=True)

    gdf = pd.merge(
        left=gdf, right=id_to_name, on=id_name, how="left", validate="one_to_one"
    )

    return gdf