r-spatial / sf

Simple Features for R
https://r-spatial.github.io/sf/
Other
1.31k stars 292 forks source link

Speeding up st_intersection #801

Closed ekarsten closed 5 years ago

ekarsten commented 6 years ago

I noticed while doing some work with st_intersection that I could speed it up by applying st_intersects first and then only calling st_intersection on the pairs of geometries that actually intersect. In light of this, I tried implementing a fork of sf that builds this check into the st_intersection function.

I am a novice C++ programmer, so I'm sure my implementation isn't ideal, but I have linked it here around line 730.

Below is some reproducible example code to test the speed of the new version vs. the old version. The speed improvement isn't insane (~ 20%), but it's nontrivial. I am intersecting North Carolina with some 3mi by 3mi grids.

## Checking to see if new version of library is faster

# devtools is broken so you have to do this, sorry :/
library(devtools)
install_github("r-lib/devtools")
library(devtools)
install_github("ekarsten/sf", force = T)
library(sf)
library(dplyr)

utm_crs <- 32614
mile_2_meter <- 1609.344

nc <-
  system.file("shape/nc.shp", package="sf") %>%
  st_read() %>%
  st_transform(utm_crs) %>%
  select()

grids <- 
  nc %>%
  st_make_grid(cellsize = 3*c(mile_2_meter, mile_2_meter)) %>%
  st_sf(seq(1, length(.)), ., crs = utm_crs) %>%
  select(Grid = starts_with("X1"),
         geometry = ".")

# check-time
system.time({new = st_intersection(nc, grids)})

remove.packages("sf")
install.packages("sf")
library(sf)

# check-time
system.time({old = st_intersection(nc, grids)})

I am using GEOS 3.6.1, GDAL 2.2.3, proj.4 4.9.3 Session info:

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.4    mapview_2.4.0        dplyr_0.7.6          sf_0.6-4             usethis_1.3.0        devtools_1.13.6.9000

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17       lattice_0.20-35    png_0.1-7          class_7.3-14       assertthat_0.2.0   rprojroot_1.3-2    digest_0.6.15     
 [8] foreach_1.4.4      mime_0.5           R6_2.2.2           plyr_1.8.4         backports_1.1.2    stats4_3.5.1       e1071_1.6-8       
[15] httr_1.3.1         pillar_1.2.3       rlang_0.2.1        curl_3.2           rstudioapi_0.7     callr_2.0.4        raster_2.6-7      
[22] R.utils_2.6.0      R.oo_1.22.0        desc_1.2.0         webshot_0.5.0      rgdal_1.3-3        htmlwidgets_1.2    munsell_0.5.0     
[29] shiny_1.1.0        compiler_3.5.1     httpuv_1.4.4.2     pkgconfig_2.0.1    base64enc_0.1-3    pkgbuild_1.0.0     htmltools_0.3.6   
[36] tidyselect_0.2.4   tibble_1.4.2       codetools_0.2-15   viridisLite_0.3.0  crayon_1.3.4       withr_2.1.2        later_0.7.3       
[43] R.methodsS3_1.7.1  grid_3.5.1         jsonlite_1.5       spData_0.2.9.0     satellite_1.0.1    xtable_1.8-2       DBI_1.0.0         
[50] git2r_0.22.1       magrittr_1.5       units_0.6-0        scales_0.5.0       cli_1.0.0          promises_1.0.1     leaflet_2.0.1     
[57] bindrcpp_0.2.2     sp_1.3-1           testthat_2.0.0     iterators_1.0.9    tools_3.5.1        gdalUtils_2.0.1.14 glue_1.2.0        
[64] purrr_0.2.5        crosstalk_1.0.0    processx_3.1.0     pkgload_1.0.0      yaml_2.1.19        colorspace_1.3-2   classInt_0.2-3    
[71] memoise_1.1.0      knitr_1.20         bindr_0.1.1 
etiennebr commented 6 years ago

Thanks, this makes a lot of sense; postgis also benefits from a st_intersects before an st_intersection.

I think to make it easier for @edzer to review your changes would be to squash the commits and create a PR.

ekarsten commented 6 years ago

Thanks for the feedback. I squashed into one commit and submitted PR #803.

I haven't had time to investigate, but I suspect similar speed gains could be achieved in st_difference by performing similar checks (but maybe not) if two objects don't intersect, then st_difference should just return the first, if they do, maybe call st_within to see if st_difference shouldn't return output and then finally call st_difference if a new geometry needs to be returned.

etiennebr commented 6 years ago

I think you're right. It looks like another layer, after indexing, to speedup computations. I wonder when or if it slows it down in certain cases. I guess looking at the postgis execution plans for these operations should be pretty informative.

Robinlovelace commented 5 years ago

Quick follow-up on this, did this get implemented @ekarsten? It seems from #803 that the PR didn't go in. Could be very useful for my work at the moment! Thanks.

gdmcdonald commented 1 year ago

Found this thread years later. I checked for st_intersects(), subset out only those that do intersect, then used st_intersection() afterwards. The workaround sped things up heaps, from an overnight task that never finished to a 15s job:

st_intersection_faster <- function(x,y,...){
#faster replacement for st_intersection(x, y,...)

  y_subset <-
    st_intersects(x, y) %>%
    unlist() %>%
    unique() %>%
    sort() %>%
    {y[.,]}

  st_intersection(x, y_subset,...)
}
Robinlovelace commented 1 year ago

Looks like that belongs in an {sfextras} type package.

rsbivand commented 1 year ago

Please do provide full sessionInfo() output for platform type and package versions, at least a summary of the objects (is s2 involved or GEOS), GEOS version (NG topology engine used or not), and access to an input object showing these characteristics. Also memory use while running, feels like memory exceeded. sf does use STRtree planar indexing anyway for topological operations, discussed in ASDAR II edition ch. 5 https://asdar-book.org/ p. 138-9, https://asdar-book.org/book2ed/cm2_mod.R chunks 41-42. The STRtree finds overlapping bounding boxes, so a superset of intersections, but performance depends on the shapes of the input geometries.

rsbivand commented 1 year ago

Also @gdmcdonald is your case binary or n-ary (self-intersections)? The latter can be very hard to handle with multiple overlapping geometries, where the STRtree approach may be defeated.

kadyb commented 1 year ago

I did this synthetic benchmark and I can see there is a big problem with the {sf} - {s2} interaction. This operation on this small dataset took milliseconds in {sf/GEOS} and {s2} (using s2_intersects_matrix()), but in {sf/s2} it took more than 3 minutes.

Code ```r library("s2") library("sf") set.seed(123) n = 10000 k = 500 idx = sample(n, size = k) ### sf - GEOS ### g = st_as_sf(data.frame(x = runif(n), y = runif(n)), coords = 1:2, crs = 3857) buffer = st_buffer(g[idx, ], 0.01) t = bench::mark( iterations = 5, check = FALSE, st_intersection(g, buffer), st_intersection_faster(g, buffer) ) t[, 1:5] #> expression min median `itr/sec` mem_alloc #> 1 st_intersection(g, buffer) 59.2ms 62.9ms 16.0 2.2MB #> 2 st_intersection_faster(g, buffer) 150.6ms 157.3ms 6.27 6.35MB ### sf - s2 ### g = st_transform(g, 4326) buffer = st_transform(buffer, 4326) system.time(st_intersection(g, buffer)) #> user system elapsed #> 190.22 3.97 194.19 system.time(st_intersection_faster(g, buffer)) #> user system elapsed #> 174.02 3.21 177.26 ### s2 ### system.time({ sel = unlist(s2_intersects_matrix(buffer, g)) g[sel, ] }) #> user system elapsed #> 0.2 0.0 0.2 ```
kadyb commented 1 year ago

Rplot

One more test on the same example. Overall, geos_intersects_matrix() appears to be faster than s2_intersects_matrix() and sf::st_intersects() is faster than sf::st_intersection(). But I think the exact case of @gdmcdonald would be more interesting to check.

Code ```r library("s2") library("sf") library("geos") set.seed(123) n = c(10000, 50000, 100000, 150000, 200000, 300000) k = c(500, 2500, 5000, 10000, 20000, 30000) idx = mapply(n, FUN = sample, size = k) t1 = t2 = t3 = t4 = numeric(length(n)) for (i in seq_along(n)) { g = st_as_sf(data.frame(x = runif(n[i]), y = runif(n[i])), coords = 1:2, crs = 3857) buffer = st_buffer(g[idx[[i]], ], 0.01) ## st_intersection t1[i] = system.time({ st_intersection(g, buffer) })[[3]] ## st_intersects t2[i] = system.time({ sel = unlist(st_intersects(buffer, g)) g[sel, ] })[[3]] ## geos t3[i] = system.time({ sel = unlist(geos_intersects_matrix(buffer, g)) g[sel, ] })[[3]] ## s2 g = st_transform(g, 4326) buffer = st_transform(buffer, 4326) t4[i] = system.time({ sel = unlist(s2_intersects_matrix(buffer, g)) g[sel, ] })[[3]] } plot(NULL, xlim = c(0, max(n)), ylim = c(0, 1.1 * max(t4)), ylab = "Time [s]", xlab = "Number of points") lines(n, t1, type = "b", col = 1) lines(n, t2, type = "b", col = 2) lines(n, t3, type = "b", col = 3) lines(n, t4, type = "b", col = 4) legend("topleft", horiz = TRUE, lty = 1, col = 1:4, bty = "n", legend = c("st_intersection", "st_intersects", "geos", "s2")) ```
edzer commented 1 year ago

There's still this note here:

https://github.com/r-spatial/sf/blob/b54923714a994bd2a16b46342a20889f857fb51d/R/geom-transformers.R#L695

that essentially indicates we don't yet have a s2_intersection_matrix, but use a blunt loop. How hard would it be to create the _matrix version, @paleolimbot ?

rsbivand commented 1 year ago

And the second code chunk in https://r-spatial.org/book/12-Interpolation.html#a-population-grid ...

paleolimbot commented 1 year ago

Probably not hard!

EhrmannS commented 1 year ago

Was there any fix to this already? I've been running a st_intersection after pre-selecting with st_intersects, and I got a very significant speed-up (that presumably depends on the ratio of non-intersecting features).

For anybody else that also lands here, this is the snippet I am using:

library(sf)
library(dplyr)
library(purrr)
library(progress)

intersections <- st_intersects(x = xFeatures, y = yFeatures)

pb <- progress_bar$new(format = "[:bar] :current/:total (:percent)", total = dim(xFeatures)[1])

intersectFeatures <- map_dfr(1:dim(xFeatures)[1], function(ix){
  pb$tick()
  st_intersection(x = xFeatures[ix,], y = yFeatures[intersections[[ix]],])
})
KatieMurenbeeld commented 1 month ago

I know this is a year later, but to follow up on @EhrmannS comment. I used their code example, but when I do so I run into funny errors if I then try to use other st_*() functions. For example, if I try to run st_area(intersectFeatures) I get

Error in `stopifnot()`:
ℹ In argument: `area = st_area(.)`.
Caused by error:
! Not compatible with requested type: [type=list; target=double].

but if I run st_area(head(intersectFeatures)) it runs no problem. I am trying to figure out what happens to the geometries when I use head(intersectFeatures)?