An experimental R package to parallelise some functions from the
excellent {sf} and
{rmapshaper} packages using
the also brilliant {furrr}
package. Right now, it’s just parallel versions of st_join
and
st_filter
from
{sf} and
ms_simplify
from {rmapshaper} (Although this now seems essentially redundant after
some major performance improvements in rmapshaper). They won’t always
help and may be slower but, sometimes it might be useful. This is just
messing about right now tbh. The code is deliberately copied from {sf},
{rmapshaper} and {furrr} so that it can be used as a drop in
replacement.
I’ve added {geoarrow} as a dependency to play with using it to pass data between cores - it seems to be fractionally faster and solves the issue of passing objects larer than the limit allowed by furrr… make sure to install {arrow} if you want to try this out.
Make sure to experiment with the number of cores - often it will be much more efficient to use a small number of processes than all of your machines’s availabel processes due to the start up time of those processes.
# install.packages("remotes")
remotes::install_github('h-a-graham/sfurrr')
So here is the some data included in the package - it is the English cycle network from Open Street Map (downloaded with {osmextract}) and British counties from Ordnance survey.
library(sfurrr)
#built in functions to load the data.
cwe <- cycleways_england()
gbc <- gb_counties()
basetheme::basetheme("dark") # makes it pretty
plot(gbc['geometry'], axes = TRUE)
plot(cwe['geometry'], add=TRUE, col='#39C17360')
summary(cwe)
## osm_id highway geometry
## Length:120451 Length:120451 LINESTRING :120451
## Class :character Class :character epsg:27700 : 0
## Mode :character Mode :character +proj=tmer...: 0
summary(gbc)
## Name Area_Description geometry
## Length:91 Length:91 MULTIPOLYGON :91
## Class :character Class :character epsg:27700 : 0
## Mode :character Mode :character +proj=tmer...: 0
Now, let’s say we want to do a spatial join between the cycleways and the counties so we attach the county data to the cycleway network. This might allow us to do some summarised stats on the cycle network of different counties, for example.
So let’s do this with {sf} which is loaded by default with {sfurr}. Let’s also get some timings with {tictoc}
library(tictoc)
tic()
join.sf <- st_join(cwe,
gbc)
toc()
## 6.165 sec elapsed
plan(multisession, workers = 4)
tic()
join.sfurr <- future_st_join(cwe,
gbc)
toc()
## 4.45 sec elapsed
Okay.. so {sf} is actually pretty fast! by using a small number of
cores - here 4, we can get a slight speed up - any more cores and it
would be increasingly slow. But what about more costly spatial
operations? Let’s try now with the option largest=TRUE
which joins
based on the largest amount of intersection.
# ------------ `st_join` ----------------
tic()
joinL.sf <- st_join(cwe,
gbc, largest=TRUE)
toc()
## 91.203 sec elapsed
# ------------ `future_st_join` ----------------
plan(multisession, workers = 8)
tic()
joinL.sfurr <- future_st_join(cwe,
gbc, largest=TRUE)
toc()
## 25.912 sec elapsed
Okay so now we see that going parallel does indeed offer some potential uses when using a costly spatial function. Here we use 8 processes and it pays off more due to the expensive computation.
Once again, here is a comparison of the simplest approach with the
st_intersect
spatial predicate.
# ------------ `st_fiter` ----------------
tic()
filt_t1 <- st_filter(cwe['highway'],
gbc[1:50,])
toc()
## 5.472 sec elapsed
# ----------- `future_st_filter` -----------------
plan(multisession, workers = 4)
tic()
filt_t2 <- future_st_filter(cwe['highway'],
gbc[1:50,])
toc()
## 4.28 sec elapsed
Again with using a limited number of cores there is a small speed up but
not that much… Let’s use the st_within
spatial predicate to filter out
cycleways that are not located entirely within the county areas… This is
kind of pointless for this use case and is just illustrative really…
# ------------ `st_filter` ----------------
tic()
within_filt_t1 <- st_filter(joinL.sfurr,
gbc[1:50,], .predicate = st_within)
toc()
## 68.937 sec elapsed
# ----------- `future_st_filter` -----------------
plan(multisession, workers = 6)
tic()
within_filt_t2 <- future_st_filter(joinL.sfurr,
gbc[1:50,], .predicate = st_within)
toc()
## 23.74 sec elapsed
Cool, so in this case it is faster!
This is not a globally useful idea but in some cases, when using very large spatial datasets, you may get a speed up by running spatial filters/joins in parallel. Any speed up will depend on the number of processes you can run; rememeber it is probably not wise to use all the cores at your disposal - sometimes less is more!
plot(gbc['geometry'], axes = TRUE)
plot(within_filt_t2['Name'], add=TRUE)