r-spatial / sf

Simple Features for R
https://r-spatial.github.io/sf/
Other
1.35k stars 299 forks source link

st_buffer very slow for complex linestrings #2039

Closed jniedballa closed 1 year ago

jniedballa commented 2 years ago

st_buffer() can be very slow with complex linestrings, e.g. from GPS tracklogs. Especially clusters of points (e.g. when a GPS device didn't move but kept recording points, as shown below) can take very long to process:

image

In the track shown above st_buffer() took 1250 seconds, but in QGIS gdal:buffervectors took about 50 seconds.

Is there any way to improve the performance of st_buffer() in such clustered points?

As a reproducible example, below is a function for creating random linestrings, either linear or clustered, with adjustable number of points, and some benchmark results to illustrate the differences between linear and clustered linestrings.

library(sf)
library(ggplot2)

simulate_linestring <- function(n_pts, type, dist = 1) {
  if(type == "cluster") {
    coords_tmp <- matrix(rnorm(n_pts * 2), n_pts, 2)
    coords_tmp[(n_pts/2) : n_pts,1] <- coords_tmp[(n_pts/2) : n_pts,1] + 10
  }

  if(type == "linear") {
    coords_tmp <- matrix(rnorm(n_pts * 2), n_pts, 2)
    coords_tmp[,1] <- coords_tmp[,1] + seq(1, length.out = nrow(coords_tmp), by = 1)
  }

  linestr <- st_linestring(x = coords_tmp, dim = "XY")

  timing <- system.time(
    linestr_buff <- st_buffer(linestr, dist = dist)
  )

  timing_df <- cbind(as.data.frame(t(c(timing, 
                                       n_pts = n_pts))),
                     type = type)

  return(list(linestring = linestr,
              buffer = linestr_buff,
              timing = timing_df))
}

Simulate linetrings with varying number of points:

number_points <- c(10, 100, 500, 1000, 2000)
out_lin <- lapply(number_points, 
               simulate_linestring,
               type = "linear")

out_clu <- lapply(number_points, 
                  simulate_linestring,
                  type = "cluster")

names(out_lin) <- names(out_clu) <- number_points

Difference between linear and clustered linestrings (both with their buffer):

par(mfrow = c(1,2))
plot(out_lin$`500`$buffer, col = "grey", main = "linear")
plot(out_lin$`500`$linestring, add = T, col = "red")
axis(1); axis(2)

plot(out_clu$`500`$buffer, col = "grey", main = "cluster")
plot(out_clu$`500`$linestring, add = T, col = "red")
axis(1); axis(2)

image

st_buffer takes much longer to process the clustered points (~300x longer for 2000 vertices):

df_timings_lin <- do.call("rbind", lapply(out_lin, FUN = function(x) x$timing))
df_timings_clu <- do.call("rbind", lapply(out_clu, FUN = function(x) x$timing))

df_timings <- rbind(df_timings_lin, df_timings_clu)

ggplot(df_timings, aes(x = n_pts, y = elapsed, color = type)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  xlab ("Number of vertices") + ylab("Time elapsed [s]")

image

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.0 sf_1.0-8     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9         rstudioapi_0.14    magrittr_2.0.3     units_0.8-0        munsell_0.5.0      tidyselect_1.2.0  
 [7] colorspace_2.0-3   R6_2.5.1           rlang_1.0.6        fansi_1.0.3        dplyr_1.0.10       tools_4.2.1       
[13] grid_4.2.1         gtable_0.3.1       KernSmooth_2.23-20 utf8_1.2.2         cli_3.4.1          e1071_1.7-12      
[19] DBI_1.1.3          withr_2.5.0        class_7.3-20       assertthat_0.2.1   tibble_3.1.8       lifecycle_1.0.3   
[25] farver_2.1.1       vctrs_0.5.0        glue_1.6.2         labeling_0.4.2     proxy_0.4-27       compiler_4.2.1    
[31] pillar_1.8.1       scales_1.2.1       generics_0.1.3     classInt_0.4-8     pkgconfig_2.0.3 
edzer commented 2 years ago

Buffering happens in the upstream libraries, GEOS or s2geometry, see for instance here, for performance issues with these libraries this is the wrong place. What you could do in sf is for instance simplifying the line geometry, and buffering that.

kadyb commented 2 years ago

If you are using a planar coordinate system, you can try geos directly -- it's a bit faster. You can also limit the number of segments (nQuadSegs, quad_segs), but it doesn't make a significant difference. As Edzer wrote, it's probably best to simplify the geometry.

library("sf")
library("geos")
n = 1000
x = simulate_linestring(n, "cluster") # this returns `linestr`

t = bench::mark(
  check = FALSE,
  st_buffer(x, dist = 1, nQuadSegs = 30),
  st_buffer(x, dist = 1, nQuadSegs = 5),
  geos_buffer(x, distance = 1)
)
t[, 1:4]
#>   expression                                  min   median `itr/sec`
#> 1 st_buffer(x, dist = 1, nQuadSegs = 30)    3.77s    3.77s     0.265
#> 2 st_buffer(x, dist = 1, nQuadSegs = 5)     3.62s    3.62s     0.276
#> 3 geos_buffer(x, distance = 1)              2.75s    2.75s     0.364
jniedballa commented 2 years ago

Thank you both, and apologies for the out-of-scope question. Good to know about geos_buffer, that's helpful. st_simplify unfortunately doesn't seem to make a big difference (in terms of processing time of subsequent calls to st_buffer) on these point/line clusters.

For what it's worth, calling the GDAL buffervectors tool from R via the OSGEO4W shell with something like the code below is fast (even with the original data) and may be useful in some cases:

system("C:/OSGeo4W/OSGeo4W.bat qgis_process-qgis run gdal:buffervectors --distance_units=meters --area_units=m2 --ellipsoid=EPSG:7030 --INPUT=C:/path/to/file/input.shp --GEOMETRY=geometry --DISTANCE=20 --FIELD= --DISSOLVE=false --EXPLODE_COLLECTIONS=false --OPTIONS= --OUTPUT=C:/path/to/file/output.shp")

kadyb commented 2 years ago

st_simplify unfortunately doesn't seem to make a big difference

By removing 40% of vertices I see ~3x speedup on this example and the results look very similar.

n = 2000
set.seed(123)
x = simulate_linestring(n, "cluster")
system.time(t1 <- st_buffer(x, dist = 1))
#>  user  system elapsed 
#> 23.26    0.81   24.12

x = st_simplify(x, , dTolerance = 1) # 1177 vertices
system.time(t2 <- st_buffer(x, dist = 1))
#> user  system elapsed 
#> 8.64    0.16    8.83

par(mfrow = c(1, 2))
plot(t1, main = "Original data")
plot(t2, main = "Simplified data")

Rplot

jniedballa commented 2 years ago

Thank you, this is good. Using the attached GPS tracklog, I see no improvement in processing time of st_buffer when simplifying with dTolerance = 0 (despite a 33% reduction in vertices). dTolerance = 1 and 5 (in meters) however bring massive improvements. geos_buffer is also much faster then st_buffer, even without simplification.

linestr <- st_read(dsn = "...",    
                   layer = "gps_tracklog_with_clusters")  # 5594 vertices

linestr_simp0 <- st_simplify(linestr, dTolerance = 0)  # 3780 vertices
linestr_simp1 <- st_simplify(linestr, dTolerance = 1)  # 1623 vertices 
linestr_simp5 <- st_simplify(linestr, dTolerance = 5)  # 433 vertices

system.time(
st_buffer(linestr,  dist = 20)
)
       User      System elapsed
    1201.70        1.00     1204.06 

system.time(
st_buffer(linestr_simp0, dist = 20)
)
       User      System elapsed
    1202.75        0.47     1204.00 

system.time(
 st_buffer(linestr_simp1, dist = 20)
)
       User      System elapsed
       3.97        0.05        4.01 

system.time(
st_buffer(linestr_simp5, dist = 20)
)
       User      System elapsed
       0.50        0.06        0.56 

system.time(
 geos_buffer(linestr, dist = 20)
)
       User      System elapsed
      18.64        0.71       19.36 

The results all look very similar, even the buffer around the highly simplified track is close to the others.

So st_simplify + st_buffer and geos_buffer both are viable solutions for buffering heavily clustered linestrings, and much faster than st_buffer on the original data. Thank you very much for your help, problem solved!

gps_tracklog_with_clusters.zip

kadyb commented 2 years ago
system.time(sf::st_buffer(linestr,  dist = 20))
#>    User  System  elapsed
#> 1201.70    1.00  1204.06 

system.time(geos::geos_buffer(linestr,  dist = 20))
#>  User  System  elapsed
#> 18.64    0.71    19.36

Some performance differences between {sf} and {geos} are expected but it is very strange why there is such difference (~60 times on this dataset). But it is good that you managed to solve this issue.

edzer commented 2 years ago

Are the results identical?

kadyb commented 2 years ago

The results are identical, but on my PC (I use Windows too) the timings are similar. Strange.

linestr = sf::read_sf("gps_tracklog_with_clusters.shp")
system.time(t1 <- sf::st_buffer(linestr, dist = 20))
#>  user  system elapsed 
#> 21.00    0.88   21.89

system.time(t2 <- geos::geos_buffer(linestr,  distance = 20))
#>  user  system elapsed 
#> 15.73    0.56   16.30

identical(sf::st_as_sfc(t1), sf::st_as_sfc(t2))
#> TRUE

sf::sf_extSoftVersion()
#> GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H           PROJ
#> "3.9.3"        "3.5.2"        "8.2.1"         "true"         "true"        "8.2.1"
jniedballa commented 2 years ago

A fresh installation of sf fixed it by updating the external libraries from

> sf::sf_extSoftVersion()   # sf v1.0-8
          GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H           PROJ 
       "3.9.1"        "3.4.3"        "7.2.1"         "true"         "true"        "7.2.1" 

to

> sf::sf_extSoftVersion()    # sf v1.0-9
          GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H           PROJ 
       "3.9.3"        "3.5.2"        "8.2.1"         "true"         "true"        "8.2.1" 

Now the timing is similar to geos_buffer:

system.time(
st_buffer(linestr, dist = 20)
)
       User      System elapsed
      23.70        1.35       25.08

So it was something in the external libraries. Thank you very much!

dr-jts commented 1 year ago

Highly complex linestrings do cause the standard buffer algorithm to be slow, due to the number of buffer line segments that are generated and processed internally.

An approach which can be much faster is to split the line into sections (say, 10 vertices long), buffer the sections, and then union the results. This can be tens of times faster.

Ideally this heuristic improvement could be provided automatically by the buffer code, but it's hard to detect when it should be applied. It could easily be supplied as a separate buffer function, for use at user's discretion (see JTS prototype implementation).