Closed jniedballa closed 1 year ago
Buffering happens in the upstream libraries, GEOS or s2geometry, see for instance here, for performance issues with these libraries this is the wrong place. What you could do in sf
is for instance simplifying the line geometry, and buffering that.
If you are using a planar coordinate system, you can try geos directly -- it's a bit faster. You can also limit the number of segments (nQuadSegs
, quad_segs
), but it doesn't make a significant difference. As Edzer wrote, it's probably best to simplify the geometry.
library("sf")
library("geos")
n = 1000
x = simulate_linestring(n, "cluster") # this returns `linestr`
t = bench::mark(
check = FALSE,
st_buffer(x, dist = 1, nQuadSegs = 30),
st_buffer(x, dist = 1, nQuadSegs = 5),
geos_buffer(x, distance = 1)
)
t[, 1:4]
#> expression min median `itr/sec`
#> 1 st_buffer(x, dist = 1, nQuadSegs = 30) 3.77s 3.77s 0.265
#> 2 st_buffer(x, dist = 1, nQuadSegs = 5) 3.62s 3.62s 0.276
#> 3 geos_buffer(x, distance = 1) 2.75s 2.75s 0.364
Thank you both, and apologies for the out-of-scope question. Good to know about geos_buffer
, that's helpful. st_simplify
unfortunately doesn't seem to make a big difference (in terms of processing time of subsequent calls to st_buffer
) on these point/line clusters.
For what it's worth, calling the GDAL buffervectors tool from R via the OSGEO4W shell with something like the code below is fast (even with the original data) and may be useful in some cases:
system("C:/OSGeo4W/OSGeo4W.bat qgis_process-qgis run gdal:buffervectors --distance_units=meters --area_units=m2 --ellipsoid=EPSG:7030 --INPUT=C:/path/to/file/input.shp --GEOMETRY=geometry --DISTANCE=20 --FIELD= --DISSOLVE=false --EXPLODE_COLLECTIONS=false --OPTIONS= --OUTPUT=C:/path/to/file/output.shp")
st_simplify unfortunately doesn't seem to make a big difference
By removing 40% of vertices I see ~3x speedup on this example and the results look very similar.
n = 2000
set.seed(123)
x = simulate_linestring(n, "cluster")
system.time(t1 <- st_buffer(x, dist = 1))
#> user system elapsed
#> 23.26 0.81 24.12
x = st_simplify(x, , dTolerance = 1) # 1177 vertices
system.time(t2 <- st_buffer(x, dist = 1))
#> user system elapsed
#> 8.64 0.16 8.83
par(mfrow = c(1, 2))
plot(t1, main = "Original data")
plot(t2, main = "Simplified data")
Thank you, this is good. Using the attached GPS tracklog, I see no improvement in processing time of st_buffer
when simplifying with dTolerance = 0 (despite a 33% reduction in vertices). dTolerance = 1 and 5 (in meters) however bring massive improvements. geos_buffer
is also much faster then st_buffer
, even without simplification.
linestr <- st_read(dsn = "...",
layer = "gps_tracklog_with_clusters") # 5594 vertices
linestr_simp0 <- st_simplify(linestr, dTolerance = 0) # 3780 vertices
linestr_simp1 <- st_simplify(linestr, dTolerance = 1) # 1623 vertices
linestr_simp5 <- st_simplify(linestr, dTolerance = 5) # 433 vertices
system.time(
st_buffer(linestr, dist = 20)
)
User System elapsed
1201.70 1.00 1204.06
system.time(
st_buffer(linestr_simp0, dist = 20)
)
User System elapsed
1202.75 0.47 1204.00
system.time(
st_buffer(linestr_simp1, dist = 20)
)
User System elapsed
3.97 0.05 4.01
system.time(
st_buffer(linestr_simp5, dist = 20)
)
User System elapsed
0.50 0.06 0.56
system.time(
geos_buffer(linestr, dist = 20)
)
User System elapsed
18.64 0.71 19.36
The results all look very similar, even the buffer around the highly simplified track is close to the others.
So st_simplify
+ st_buffer
and geos_buffer
both are viable solutions for buffering heavily clustered linestrings, and much faster than st_buffer
on the original data. Thank you very much for your help, problem solved!
system.time(sf::st_buffer(linestr, dist = 20)) #> User System elapsed #> 1201.70 1.00 1204.06 system.time(geos::geos_buffer(linestr, dist = 20)) #> User System elapsed #> 18.64 0.71 19.36
Some performance differences between {sf}
and {geos}
are expected but it is very strange why there is such difference (~60 times on this dataset). But it is good that you managed to solve this issue.
Are the results identical?
The results are identical, but on my PC (I use Windows too) the timings are similar. Strange.
linestr = sf::read_sf("gps_tracklog_with_clusters.shp")
system.time(t1 <- sf::st_buffer(linestr, dist = 20))
#> user system elapsed
#> 21.00 0.88 21.89
system.time(t2 <- geos::geos_buffer(linestr, distance = 20))
#> user system elapsed
#> 15.73 0.56 16.30
identical(sf::st_as_sfc(t1), sf::st_as_sfc(t2))
#> TRUE
sf::sf_extSoftVersion()
#> GEOS GDAL proj.4 GDAL_with_GEOS USE_PROJ_H PROJ
#> "3.9.3" "3.5.2" "8.2.1" "true" "true" "8.2.1"
A fresh installation of sf fixed it by updating the external libraries from
> sf::sf_extSoftVersion() # sf v1.0-8
GEOS GDAL proj.4 GDAL_with_GEOS USE_PROJ_H PROJ
"3.9.1" "3.4.3" "7.2.1" "true" "true" "7.2.1"
to
> sf::sf_extSoftVersion() # sf v1.0-9
GEOS GDAL proj.4 GDAL_with_GEOS USE_PROJ_H PROJ
"3.9.3" "3.5.2" "8.2.1" "true" "true" "8.2.1"
Now the timing is similar to geos_buffer:
system.time(
st_buffer(linestr, dist = 20)
)
User System elapsed
23.70 1.35 25.08
So it was something in the external libraries. Thank you very much!
Highly complex linestrings do cause the standard buffer algorithm to be slow, due to the number of buffer line segments that are generated and processed internally.
An approach which can be much faster is to split the line into sections (say, 10 vertices long), buffer the sections, and then union the results. This can be tens of times faster.
Ideally this heuristic improvement could be provided automatically by the buffer code, but it's hard to detect when it should be applied. It could easily be supplied as a separate buffer function, for use at user's discretion (see JTS prototype implementation).
st_buffer() can be very slow with complex linestrings, e.g. from GPS tracklogs. Especially clusters of points (e.g. when a GPS device didn't move but kept recording points, as shown below) can take very long to process:
In the track shown above st_buffer() took 1250 seconds, but in QGIS gdal:buffervectors took about 50 seconds.
Is there any way to improve the performance of st_buffer() in such clustered points?
As a reproducible example, below is a function for creating random linestrings, either linear or clustered, with adjustable number of points, and some benchmark results to illustrate the differences between linear and clustered linestrings.
Simulate linetrings with varying number of points:
Difference between linear and clustered linestrings (both with their buffer):
st_buffer takes much longer to process the clustered points (~300x longer for 2000 vertices):