`s2_union_agg()` is slow

paleolimbot commented 3 years ago

The strategy of accumulating a union is correct but slow! I assume there is a faster way to go about this that I haven't found in the documentation yet.

library(s2)
library(sf)
#> Warning: package 'sf' was built under R version 4.0.5
#> Linking to GEOS 3.8.1, GDAL 3.2.0, PROJ 7.2.0

countries <- s2_data_countries()
countries_sf <- st_as_sf(countries)

bench::mark(
  s2_coverage_union_agg(countries),
  s2_union_agg(countries),
  st_union(countries_sf),
  check = F
)
#> # A tibble: 3 x 6
#>   expression                            min   median `itr/sec` mem_alloc
#>   <bch:expr>                       <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 s2_coverage_union_agg(countries) 100.89ms  101.7ms     9.59     61.7KB
#> 2 s2_union_agg(countries)             2.82s    2.82s     0.355    43.7KB
#> 3 st_union(countries_sf)           141.69ms 144.13ms     6.90    666.1KB
#> # … with 1 more variable: gc/sec <dbl>

^{Created on 2021-04-28 by the reprex package (v0.3.0)}

vwmaus commented 3 years ago

This is not a solution for the speed of s2_union_agg. I am posting a wrapper function that might be useful to someone landing on this issue, such as I did while looking for faster processing using s2_union_agg.

The function below s2_union_split_agg is a wrapper function to speed s2 union transformation of large data sets. For small data sets that are highly intersecting (e.g. countries) this workaround does not have much effect and might even be slower. However, on large data sets with sparse features, s2_union_split_agg is extremely fast because it splits the features into intersecting groups of features and applies s2_union_agg only to these groups independently (potentially in parallel). It solves internal boundaries first for each intersecting group then combines all results into one object.

s2_union_split_agg needed about 70 seconds to perform the union on a personal global data set with 47,069 polygon features, including about 4,000 intersecting polygons. I did not measure the time using s2_union_agg on this data set because I had to kill the process after about 1.5h. Below is an example using the countries data set. Hope this is useful to someone.

library(s2)
library(sf)
#> Linking to GEOS 3.9.0, GDAL 3.2.2, PROJ 7.2.1
library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

s2_union_split_agg <- function(x, options = s2_options()){

  # check for intersects 
  idx_intersects <- s2::s2_intersects_matrix(x, x, options = options)

  # prapare groups of intersects 
  idx_list <- lapply(seq_along(idx_intersects)[lengths(idx_intersects) > 1], function(i) c(i, idx_intersects[[i]]))
  group_list <- tibble::tibble(fid = integer(), gid = integer())
  for(i in idx_list){
    # check if intersecting fids already exist in the groups
    gids <- dplyr::filter(group_list, fid %in% unique(i))
    if(nrow(gids) > 0){
      # get for new fids 
      new_ids <- unique(i)[!unique(i) %in% group_list$fid]
      # get existing groups 
      new_gid <- unique(gids$gid)[1]
      # add new fids to existing groups
      group_list <- tibble::add_row(group_list, fid = new_ids, gid = new_gid) 
      # merge groups if new fids overlap several groups
      group_list <- dplyr::mutate(group_list, gid = ifelse(gid %in% gids$gid, new_gid, gid))
    } else {
      # create a new group id
      gid <- ifelse(nrow(group_list) == 0, 0, max(group_list$gid)) + 1
      # add all new fids to the new group
      group_list <- dplyr::bind_rows(tibble::tibble(fid = unique(i), gid = gid), group_list)
    }
  }

  # split features based on groups of intersects 
  x_split <- split(x[group_list$fid], group_list$gid)

  # spare features that do not intersect other features
  x <- x[-group_list$fid]

  # apply s2_union_agg to groups 
  x_union <- lapply(x_split, s2_union_agg, options = options) 

  # gather results into single object 
  result <- s2_rebuild_agg(c(x, do.call("c", x_union)), options = options)

  return(result)

}

countries <- s2_data_countries()

system.time(res1 <- s2_union_split_agg(countries, options = s2_options(model = "closed")))
#>    user  system elapsed 
#>   4.067   0.028   4.105
system.time(res2 <- s2_union_agg(countries, options = s2_options(model = "closed")))
#>    user  system elapsed 
#>   4.527   0.002   4.530

# the output geometries have no difference 
s2_difference(res1, res2)
#> <s2_geography[1]>
#> [1] <GEOMETRYCOLLECTION EMPTY>

# but the objects are not identical 
identical(res1, res2)
#> [1] FALSE

plot(st_as_sf(res1))

plot(st_as_sf(res2))

^{Created on 2021-09-17 by the reprex package (v2.0.1)}

paleolimbot commented 2 years ago

I never got around to thanking you for this example! I can and will implement something like this at the C++ level to speed up this type of union...hopefully for the next release!

paleolimbot commented 2 years ago

I still can and will make better use of this idea of intersecting groups, but I'm including a slight improvement in #165 that will make this a bit faster (although it's still O(n)).

# remotes::install_github("r-spatial/s2#165")
library(s2)
library(ggplot2)

res <- bench::press(
  n = c(1:5),
  bench::mark(s2_union_agg(rep(s2_data_countries(), n)))
)
#> Running with:
#>       n
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

ggplot(res, aes(n, median)) +
  geom_point()

# install.packages("s2")
library(s2)
library(ggplot2)

res <- bench::press(
  n = c(1:5),
  bench::mark(s2_union_agg(rep(s2_data_countries(), n)))
)
#> Running with:
#>       n
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

ggplot(res, aes(n, median)) +
  geom_point()

^{Created on 2022-03-05 by the reprex package (v2.0.1)}

r-spatial / s2

`s2_union_agg()` is slow #103