pysal / tobler

Spatial interpolation, Dasymetric Mapping, & Change of Support
https://pysal.org/tobler
BSD 3-Clause "New" or "Revised" License
146 stars 30 forks source link

Pick attribute based on largest intersection #101

Closed martinfleis closed 3 years ago

martinfleis commented 3 years ago

I have a function which links variables from one df to the other based on the largest intersection (which of the geometries in source_df has the largest area when intersected with geometry in target_df). Though it is not as fancy as the rest of Tobler, it may be useful in some cases (like when you have almost the same geometries in both - like tessellation vs parcels).

Do you think it would make sense to have it in Tobler? Happy to do a PR.

cc @darribas

def area_max(source_df, target_df, variables):
    """
    Join attributes from source based on the largest intersection. In case of a tie it picks the first one.

    Parameters
    ----------
    source_df : GeoDataFrame
        GeoDataFrame containing source values
    target_df : GeoDataFrame
        GeoDataFrame containing source values
    variables : string or list-like
        column(s) in dataframes for variable(s)

    Returns
    -------
    GeoDataFrame

    Notes
    -----
    TODO: preserve dtype where possible

    """    
    target_df = target_df.copy()
    target_ix, source_ix = source_df.sindex.query_bulk(target_df.geometry, predicate='intersects')
    areas = target_df.geometry.values[target_ix].intersection(source_df.geometry.values[source_ix]).area

    main = []
    for i in range(len(target_df)):
        mask = target_ix == i
        if np.any(mask):
            main.append(source_ix[mask][np.argmax(areas[mask])])
        else:
            main.append(np.nan)

    main = np.array(main)
    mask = ~np.isnan(main)
    if pd.api.types.is_list_like(variables):
        for v in variables:
            arr = np.empty(len(main), dtype=object)
            arr[:] = np.nan
            arr[mask] = source_df[v].values[main[mask].astype(int)]
            target_df[v] = arr
    else:
        arr = np.empty(len(main), dtype=object)
        arr[:] = np.nan
        arr[mask] = source_df[variables].values[main[mask].astype(int)]
        target_df[variables] = arr

    return target_df
knaaptime commented 3 years ago

i can also imagine an analogous case where you have two datasets digitized from different sources, where one version is 'correct' but the other has useful data that you need to transfer to the other polys without needing to reapportion it. (I remember hand-digitizing some school boundary maps and manually entering some data points when i worked at a non-profit a few years ago, and now that data could be transferred to the new canonical boundaries released through other channels). Right now this is often done crudely through, e.g. the representative_point, but this max area method would often be preferable in these situations

all that to say i see the value in this and i think it makes sense in tobler, though i'm not 100% on where exactly, and what it should be called

sjsrey commented 3 years ago

I think this would nicely expand the functionality available to the users. So +1.