Grid Tile Clustering - Githubissues

alronlam commented 2 years ago

This is from @joshuacortez:

In geospatial projects that involve scoring grid tiles, sometimes apart from the individual tiles, we’re also interested in the clusters of contiguous tiles that share the same attribute.

For example, we have tiles classified as urban or rural and want to find the urban clusters.

Input: Geodataframe (grid_x, grid_y) Output: Geodataframe (grid_x, grid_y, cluster_id)

Important Considerations:

Scalability: Can handle many tiles and tiles of varying resolution
Runtime: Should run faster compared to the alternative spatial join approach
Tile format constraint: We don’t aim to cluster polygons of varying shapes (e.g. hexagon with square). This clustering follows directly from the “Grid Tile Generation” feature request, where the tiles are of the same square shape.

alronlam commented 2 years ago

@joshuacortez Just a clarification with the intended use case in mind, to get a clearer picture of how it'll be used.

Sample scenario: I have a gdf containing: grid_x, grid_y, geometry, urban_class

If I want to get urban clusters, I would do something like this (roughly based on the reference implementation):

urban_gdf = gdf[gdf["urban_class"] == "urban"]
urban_gdf_with_cluster_ids = cluster_tiles(urban_gdf)

gdf = # join logic here to place the cluster IDs back to the original GDF

Is this right?

For consistency with the other GeoWrangler functions, was thinking of something like this instead:

connectable_tiles = gdf[gdf["urban_class"] == "urban"]
gdf = cluster_tiles(gdf, connectable_tiles)
# gdf would then have a new column called "cluster_id" where each urban row has its corresponding cluster ID, while the the others (e.g. rural, sub-urban, etc) have NaNs.

This follows our current paradigm of input/output generally being the GDF representing the AOIs you care about.
Having the connectable_tiles param still keeps it flexible for any kind of logic you have on which tiles can be connected if adjacent (say, maybe you want urban and sub-urban).

What do you think?

joshuacortez commented 2 years ago

Yep sounds good!

The function inputs can look like this

def cluster_tiles(
    gdf: gpd.GeoDataFrame, 
    grid_x_col = "x",
    grid_y_col = "y",
    category_col: Optional[str] = None,
    categories_used: List[str] = None,
    connectivity_type: str = "four-way"
) -> gpd.GeoDataFrame:

The output is the same as the original gdf but with an appended cluster_id (string but can be NULL) tile_id, x, y, < other cols >, cluster_id

joshuacortez commented 2 years ago

Addressed in PR here #178

thinkingmachines / geowrangler

Grid Tile Clustering #165