[Feature] Dissolve Boundaries

JosiahParry commented 1 month ago

A common GIS task is to dissolve boundaries based on shared boundaries of polygons. Doing this with R always bends my head a little bit.

I think having a utility function in {sf} to do this would be very handy.

Here is a minimal example of how that function might work. Using spdep is probably the best for this task because it already has functions for identifying contiguous polygons as well as the connected subgraphs.

This is inspired by @cgmossa's PhD work

 dissolve_boundaries <- function(x) {
  if (!requireNamespace("spdep")) {
    stop("spdep is needed to use this function")
  }

  crs <- sf::st_crs(x)
  nbs <- spdep::poly2nb(x)
  subgraphs <- spdep::n.comp.nb(nbs)
  res <- lapply(
    split(x, subgraphs[["comp.id"]]),
    sf::st_union,
    is_coverage = TRUE
  )

  sf::st_sfc(
    unlist(res, recursive = FALSE),
    crs = crs
  )

}

loreabad6 commented 1 month ago

It took me a few minutes to understand what exactly this function would do in comparison to st_cast(st_union(x), "POLYGON") so I made a little reprex to show it.

library(sf)
nc = read_sf(system.file("shape/nc.shp", package="sf"))

set.seed(241)
test = nc[sample(nrow(nc), 18),] |> st_geometry()

a = test |> st_union() |> st_as_sf()
a$id = 1:nrow(a)
b = a |> st_cast("POLYGON") |> st_as_sf()
#> Warning in st_cast.sf(a, "POLYGON"): repeating attributes for all
#> sub-geometries for which they may not be constant
b$id = 1:nrow(b)
c = test |> dissolve_boundaries() |> st_as_sf() 
#> Loading required namespace: spdep
c$id = 1:nrow(c)

b
#> Simple feature collection with 11 features and 1 field
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -83.73952 ymin: 34.30505 xmax: -75.77316 ymax: 36.55716
#> Geodetic CRS:  NAD27
#> First 10 features:
#>     id                              x
#> 1    1 POLYGON ((-83.1615 35.05922...
#> 1.1  2 POLYGON ((-76.00897 36.3196...
#> 1.2  3 POLYGON ((-75.97629 36.5179...
#> 1.3  4 POLYGON ((-75.78317 36.2251...
#> 1.4  5 POLYGON ((-82.74389 35.4180...
#> 1.5  6 POLYGON ((-78.95108 36.2338...
#> 1.6  7 POLYGON ((-76.70538 35.4119...
#> 1.7  8 POLYGON ((-76.6949 35.35043...
#> 1.8  9 POLYGON ((-81.659 36.11759,...
#> 1.9 10 POLYGON ((-80.45065 35.7648...
nrow(b)
#> [1] 11

c
#> Simple feature collection with 8 features and 1 field
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: -83.73952 ymin: 34.30505 xmax: -75.77316 ymax: 36.55716
#> Geodetic CRS:  NAD27
#>                                x id
#> 1 MULTIPOLYGON (((-75.97629 3...  1
#> 2 POLYGON ((-77.98668 34.3399...  2
#> 3 POLYGON ((-78.95108 36.2338...  3
#> 4 MULTIPOLYGON (((-76.70538 3...  4
#> 5 POLYGON ((-82.24016 35.4681...  5
#> 6 POLYGON ((-81.659 36.11759,...  6
#> 7 POLYGON ((-80.50825 36.0708...  7
#> 8 POLYGON ((-83.1615 35.05922...  8
nrow(c)
#> [1] 8

cols = paletteer::paletteer_d("tvthemes::gravityFalls")
par(mfrow = c(2,2), mar = c(0,0,1,0))
plot(test, main = "Original geoms")
plot(a, main = "st_union()",
     key.pos = NULL, reset = FALSE, pal = cols)
plot(b, main = "st_union |> st_cast('POLYGON')",
     key.pos = NULL, reset = FALSE, pal = cols)
plot(c, main = "dissolve_boundaries()",
     key.pos = NULL, reset = FALSE, pal = cols)

It would be cool to have this in sf! Would this fit better as a possible st_dissolve() function or as a special case of st_union()?

edzer commented 2 weeks ago

Thanks @loreabad6 that clarifies! My guess is that the only difference between the bottom two plots is a single queen neighbour, and I'm not sure that is what you want with "dissolve".

ar-puuk commented 1 week ago

Forgive me if this is not the right thread to discuss this, but I have always wondered why {sf} didn't have an existing st_dissolve() function that does something similar to the following:

object_dissolved <- object %>%
  group_by(attribute) %>%
  summarize(
    geometry = sf::st_union(geometry), # or SHAPE in case of data loaded from GDB
    across(everything(), first),
    .groups = "drop"
  )

Here it is easy since I have single aggregation for all fields (i.e. first), however, it might be complicated to use different aggregate functions for different columns. For instance, I might want to sum() an attribute called area, take a mean() of population, and calculate median() of median_household_income. In that case, it might be much easier to group_by() and st_union() and specify aggregate function separately for different columns. While it would be cool to have a st_dissolve() function, I am unsure how difficult would it be to implement multiple aggregation within the same function.

rCarto commented 1 week ago

@ar-puuk , I may have a small tidyverse-free function for what you describe here. Not bullet proof, though...

rsbivand commented 1 week ago

Could I ask why the aggregate method for sf objects is not adequate for these needs?

ar-puuk commented 6 days ago

Could I ask why the aggregate method for sf objects is not adequate for these needs?

I do not want to hijack the initial issue, but the aggregate function is much less intuitive as is, which is not consistent with other SF functions. If I were to want to take the mean of numeric columns, sum the integer column, and take the first values of the character column, things would get messy really fast. I ended up using group_by() and summarize().

# Select only the numeric columns
numeric_cols <- sapply(wfrc_taz, is.numeric)

# Perform the aggregation only on numeric columns
wfrc_taz_agg <- aggregate(
  wfrc_taz[, numeric_cols],                   # Subset to numeric columns
  by = list(CO_NAME = wfrc_taz$CO_NAME),      # Aggregate by CO_NAME
  FUN = sum
)

However, the function provided by @rCarto seems to be the exact one I was looking for. Thank you so much. The only wish/question I have is if it would work with logical filters such as in the code below. Also, maybe the fun argument could take in the actual function (again for consistency), rather than a character of the function name.

r <- st_aggregate(nc, "dummy", c(is.character(), is.integer(), is.numeric()), c(first, sum, mean))

In that case, we would expand the by argument to take in values such as r colnames %in% c("BIR74", "NWBIR74") and so on to apply different aggregation functions to different sets of columns.

r-spatial / sf

[Feature] Dissolve Boundaries #2422