thinkingmachines / geowrangler

🌏 A python package for wrangling geospatial datasets
https://geowrangler.thinkingmachin.es/
MIT License
48 stars 15 forks source link

Grid Tile Clustering #165

Closed alronlam closed 2 years ago

alronlam commented 2 years ago

This is from @joshuacortez:

In geospatial projects that involve scoring grid tiles, sometimes apart from the individual tiles, we’re also interested in the clusters of contiguous tiles that share the same attribute.

For example, we have tiles classified as urban or rural and want to find the urban clusters.

Input: Geodataframe (grid_x, grid_y) Output: Geodataframe (grid_x, grid_y, cluster_id)

Important Considerations:

alronlam commented 2 years ago

@joshuacortez Just a clarification with the intended use case in mind, to get a clearer picture of how it'll be used.

Sample scenario: I have a gdf containing: grid_x, grid_y, geometry, urban_class

If I want to get urban clusters, I would do something like this (roughly based on the reference implementation):

urban_gdf = gdf[gdf["urban_class"] == "urban"]
urban_gdf_with_cluster_ids = cluster_tiles(urban_gdf)

gdf = # join logic here to place the cluster IDs back to the original GDF

Is this right?

For consistency with the other GeoWrangler functions, was thinking of something like this instead:

connectable_tiles = gdf[gdf["urban_class"] == "urban"]
gdf = cluster_tiles(gdf, connectable_tiles)
# gdf would then have a new column called "cluster_id" where each urban row has its corresponding cluster ID, while the the others (e.g. rural, sub-urban, etc) have NaNs. 

What do you think?

joshuacortez commented 2 years ago

Yep sounds good!

The function inputs can look like this

def cluster_tiles(
    gdf: gpd.GeoDataFrame, 
    grid_x_col = "x",
    grid_y_col = "y",
    category_col: Optional[str] = None,
    categories_used: List[str] = None,
    connectivity_type: str = "four-way"
) -> gpd.GeoDataFrame:

The output is the same as the original gdf but with an appended cluster_id (string but can be NULL) tile_id, x, y, < other cols >, cluster_id

joshuacortez commented 2 years ago

Addressed in PR here #178